+ All Categories
Home > Documents > A Three-way Superscalar R10K Microprocessor with Advanced …mjx/img/470report.pdf · 2015. 9....

A Three-way Superscalar R10K Microprocessor with Advanced …mjx/img/470report.pdf · 2015. 9....

Date post: 18-Aug-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
17
UNIVERSITY OF MICHIGAN – EECS 470 WINTER 2015 FINAL PROJECT GROUP 13 A Three-way Superscalar R10K Microprocessor with Advanced Features Group 13: Tan Bie, Yang Jiao, Du Lyu, Mengjiao Xu, Zhixun Zhao {tanbie, jiaoyang, ldytmac, mjx, zzhixun} @umich.edu Abstract This report presents a three-way superscalar R10K out-of-order microprocessor. The baseline pipeline is composed of five stages, which are Instruction-Fetch, Instruction-Decode, Dispatch, Execution and Write- back. To improve the performance of three-way superscalar microprocessor, some advanced features are implemented and tested in this report, including Instruction Cache with Pre-fetching, Non-blocking Data Cache with Miss Status Holding Register (MSHR), Load-store Queue (LSQ), Local Branch Predictor, Return Address Stack (RAS) and GUI Debugger. The overall clock period for the microprocessor is 11.5ns. Keywords Three-way Superscalar, R10K, Tomasulo’s Algorithm, Out-of-Order I. Introduction Pipelined microprocessors are dominant design for microprocessors in the market, and booming number of advanced features are added to the original pipelined structure to reduce CPI or reduce clock period. In our final project of Computer Architecture, our team endeavors to improve the basic Tomasulo’s algorithm by implementing three-way superscalar out-of-order R10K microprocessor. Besides, some advanced features are added to improve its performance, including Instruction Cache with Pre- fetching, Return Address Stack, Local Branch Predictor, Load-Store Queue and Non-blocking Data Cache with MSHR. In this report, design and implementation of baseline component will be detailed and the basic structure of the pipeline will be present in Section II. In Section III, advanced features will be present, including detailed and comprehensive design and analysis. Our testing on advanced features regarding their effects in microprocessor is in Section IV. And finally in section V, we will evaluate our overall performance and talk about some possible future improvements for our design. The specification for each single module is detailed as below in Table 1: Table 1 Design Specifications I-Cache 32 lines of 8 bytes 4-way set-associative Write-back, write-allocate 2-bit true LRU 4 pre-fetching lines Branch Predictor BHT: 16-entry PHT: 8-entry BTB: 16-entry RAS: 16-entry Baseline Features PRF: 64-entry RAT/RRAT: 32-entry RS: 16-entry ROB: 32-entry Functional Units 3 ALU 2 MULT Load-Store queue Load Queue: 8-entry Store Queue: 8-entry One data forwarding port Two load data port to d-cache One store data port to d-cache D-Cache Two load port One Store port MSHR: 8-entry And the overview of the microprocessor design is as below in Figure 1. 1
Transcript
Page 1: A Three-way Superscalar R10K Microprocessor with Advanced …mjx/img/470report.pdf · 2015. 9. 16. · Three-way Superscalar, R10K, Tomasulo’s Algorithm, Out-of-Order . BTB: 16.

UNIVERSITY OF MICHIGAN – EECS 470 WINTER 2015 FINAL PROJECT GROUP 13

A Three-way Superscalar R10K Microprocessor with Advanced Features Group 13: Tan Bie, Yang Jiao, Du Lyu, Mengjiao Xu, Zhixun Zhao

{tanbie, jiaoyang, ldytmac, mjx, zzhixun} @umich.edu

Abstract This report presents a three-way superscalar R10K out-of-order microprocessor. The baseline pipeline is composed of five stages, which are Instruction-Fetch, Instruction-Decode, Dispatch, Execution and Write-back. To improve the performance of three-way superscalar microprocessor, some advanced features are implemented and tested in this report, including Instruction Cache with Pre-fetching, Non-blocking Data Cache with Miss Status Holding Register (MSHR), Load-store Queue (LSQ), Local Branch Predictor, Return Address Stack (RAS) and GUI Debugger. The overall clock period for the microprocessor is 11.5ns. Keywords Three-way Superscalar, R10K, Tomasulo’s Algorithm, Out-of-Order

I. Introduction

Pipelined microprocessors are dominant design for microprocessors in the market, and booming number of advanced features are added to the original pipelined structure to reduce CPI or reduce clock period. In our final project of Computer Architecture, our team endeavors to improve the basic Tomasulo’s algorithm by implementing three-way superscalar out-of-order R10K microprocessor. Besides, some advanced features are added to improve its performance, including Instruction Cache with Pre-fetching, Return Address Stack, Local Branch Predictor, Load-Store Queue and Non-blocking Data Cache with MSHR. In this report, design and implementation of baseline component will be detailed and the basic structure of the pipeline will be present in Section II. In Section III, advanced features will be present, including detailed and

comprehensive design and analysis. Our testing on advanced features regarding their effects in microprocessor is in Section IV. And finally in section V, we will evaluate our overall performance and talk about some possible future improvements for our design. The specification for each single module is detailed as below in Table 1: Table 1 Design Specifications

I-Cache 32 lines of 8 bytes 4-way set-associative Write-back, write-allocate 2-bit true LRU 4 pre-fetching lines

Branch Predictor

BHT: 16-entry PHT: 8-entry BTB: 16-entry RAS: 16-entry

Baseline Features

PRF: 64-entry RAT/RRAT: 32-entry RS: 16-entry ROB: 32-entry

Functional Units

3 ALU 2 MULT

Load-Store queue

Load Queue: 8-entry Store Queue: 8-entry One data forwarding port Two load data port to d-cache One store data port to d-cache

D-Cache Two load port One Store port MSHR: 8-entry

And the overview of the microprocessor design is as below in Figure 1.

1

Page 2: A Three-way Superscalar R10K Microprocessor with Advanced …mjx/img/470report.pdf · 2015. 9. 16. · Three-way Superscalar, R10K, Tomasulo’s Algorithm, Out-of-Order . BTB: 16.

UNIVERSITY OF MICHIGAN – EECS 470 WINTER 2015 FINAL PROJECT GROUP 13

Figure 1 Design Overview

II. Baseline Components

Pipeline of the microprocessor consists of 5 stages, which are instruction fetch, instruction decode, dispatch, execution and write-back. Next, baseline components within the 5-stage pipeline will be discussed. 1. Instruction Fetch Stage Instruction fetch unit is a module used for fetching instructions. Since our design is 3-way superscalar, 2 cache lines are fetched at the same time. If the first cache line is a miss, the processor will stall to wait for it to be loaded from memory. We uses branch predictor and return address stack to help predict the next instruction to be fetched in IF Stage. The fetch

of next_pc follows the priority: rob_target_pc > return address stack_pc > branch predictor_pc > normal fetch. 2. Instruction Decode Stage 2.1 Instruction Decoder Instruction decoder unit is a module where instructions are decoded. The decoder for our 3-way superscalar design is implemented by just scaling the decoder provided. 2.2 Register Alias Table/Retire Register Alias Table (RAT/RRAT) Register Alias Table (RAT) is a module that keeps track of the latest map conditions for different registers once an instruction is issued. In our design, we use renaming to avoid false dependency, and RAT

2

Page 3: A Three-way Superscalar R10K Microprocessor with Advanced …mjx/img/470report.pdf · 2015. 9. 16. · Three-way Superscalar, R10K, Tomasulo’s Algorithm, Out-of-Order . BTB: 16.

UNIVERSITY OF MICHIGAN – EECS 470 WINTER 2015 FINAL PROJECT GROUP 13

is the module where we keep the list of renaming. Every time when we detect a true dependency, RAT would search the Free List to see which physical registers can be used for renaming, and updates itself with the correct value. There are 32 entries within RAT/RRAT and 64 entries within free-list, when misprediction happens, a signal from RoB will inform RRAT and entry content in RRAT will be copied to RAT to avoid exceptions. 2.3 Physical Register File (PRF) There are 64 entries within PRF, and PRF is also three-way in and three-way out. Internal forwarding is utilized within PRF to make sure data stored within PRF can be used during the clock cycle it’s just stored. 3. Dispatch Stage 3.1 Reservation Station (RS) Reservation Station (RS) is the key component that enables out-of-order execution of instructions. Instructions are dispatched into RS and wait for their operands to be ready. The RS keeps communicating with PRF and CDB to provide valid value for instructions and issue ready instructions to ex_stage out-of-order. In our design the RS has total 16 entries. A 16-to-3 priority selector is used to select from all free RS entries to receive incoming data. To simplify control logic, we monitor free entry number of RS and dispatch at most three instructions at a time only if the free entry number equals or is greater than 3. Another 16-to-3 priority selector is used to select and issue at most 3 “wake-up” instructions. We tried issuing “old” instruction first to ex_stage to examine whether it will give us performance boost but the improvement is not obvious. So there is no special issue priority employed in our design. 3.2 Reorder Buffer (ROB) Reorder buffer is the module which ensures the in-order retirement of out-of-order issuing instructions. The buffer is circular, with 32 entries storing valid instructions, to achieve ‘First-in, First-out’ mechanism of instruction buffering. ‘Head’ and ‘Tail’

pointer is used to achieve the circular buffering and entry flush, with instructions stored between ‘Head’ and ‘Tail’ in RoB. In accordance with the 3-way dispatch of our processor, when the amount of vacant entries within RoB is less than three, a signal will ‘tell’ the pipeline to stall and wait for more vacant entries to be available. To retire each instructions stored within RoB, signals from CDB, LSQ and Execution-stage will come back to RoB to ‘report’ the complement of each instruction. CDB reports the complement of general operations and load, and instead of real data, only PRF index will be returned to RoB. LSQ will report store operations back to RoB and Execution-stage for branch. Entries within RoB can only be retired in order, and if branch is mispredicted, entries behind the branch will be flushed and ‘Head’ and ‘Tail’ will be moved to the next entry of RoB. In addition, during flush, RoB will send signals to RS and RRAT to inform wasted entries to the two modules. 4. Execution Stage Each cycle when the reservation station issues three instructions, it will send them to the execution stage. This stage in our design has three sub modules. One is multiplier, one is normal ALU operator and the other is used for deciding whether the branch needs to be taken and calculating the branch target address. The multiplier takes four cycles to complete one multiplication and it will keep sending a signal to RS, telling it whether they are free or not. The ALU operator will calculate the normal alu operations and send them to CDB and calculate load/store address and send them to LSQ. For branch, ex-stage will compare the expected target address with predicted address from branch predictor. Execute stage has been synthesized within the clock period of 8ns. 5. Write Back Stage Common Data Bus (CDB) is used to achieve the write-back function of the microprocessor and broadcast the data to specific modules. As the limitation of three-way superscalar microprocessor,

3

Page 4: A Three-way Superscalar R10K Microprocessor with Advanced …mjx/img/470report.pdf · 2015. 9. 16. · Three-way Superscalar, R10K, Tomasulo’s Algorithm, Out-of-Order . BTB: 16.

UNIVERSITY OF MICHIGAN – EECS 470 WINTER 2015 FINAL PROJECT GROUP 13

we can only have three lines for CDB. There are more than ways providing data to CDB, thus we need to have some buffers along with the three-way CDB. Firstly, we set the priority for different data broadcasted on CDB: load data has the highest priority, multiplication result second and other operations the last. We firstly satisfy load data to be broadcasted, if some other operations’ data are ready but no CDB available, it will be stored in buffer. And during the next clock period, data stored within buffer will have the highest priority and will be broadcasted on CDB after one clock cycle delay.

III. Advanced Features 1. Instruction Cache/Prefetching (I-Cache) Our group implements both a basic 3-way blocking i-cache and one with prefetch. The size of instruction cache is 256B, with each line of 8B size, and there are 32 lines in total. The i-cache is 4-way set associative, and thus with 8 sets. It follows write-back and write-allocate mechanism. For each set, a true 2-bit LRU is implemented. The i-cache takes 2 addresses, or PCs, from the IF stage per clock cycle and gives back 2 data lines including 4 instructions, each line with one valid bit. When there is a cache hit, the data is sent to IF stage immediately. It is recognized as a 3-way i-cache because the IF stage chooses 3 instructions out of 4. The difference between the basic i-cache and the one with prefetch is that the basic one merely waits for a cache miss until the data is given from the memory, while the one with prefetch keeps sending load commands of the follow-up PCs to the memory until the buffer is filled with prefetched instructions. The one with prefetch improves CPI for certain test cases like btest1 and btest2 dramatically, while also decrease CPI up to 20% in most common cases. The finite state machine for prefetch is shown as below:

Figure 2 FSM for prefetch

Noticed that when the branch is miss-predict, the state machine will jump to ‘initial’ to check whether the target PC is a miss so that a request can be sent to memory as soon as possible. Besides, address and its response when branch miss-predict are also stored in an separate buffer to accept memory data, which can also decrease CPI up to 10% in most cases. Regarding the selection of prefetch depth, CPI of different prefetch lines are tested and shown in Figure 3, and from the illustration, four-prefetch-line scheme is the best choice. Figure 3 CPI for different prefetch lines

2. Branch Predictor/Branch Target Buffer (BP/BTB) Branch predictor and BTB is a module that predicts the next pc to be fetched in order to improve performance by decreasing number of branch mispredictions. In our design we implement a 3-bit local history predictor together with BTB. They both

4

Page 5: A Three-way Superscalar R10K Microprocessor with Advanced …mjx/img/470report.pdf · 2015. 9. 16. · Three-way Superscalar, R10K, Tomasulo’s Algorithm, Out-of-Order . BTB: 16.

UNIVERSITY OF MICHIGAN – EECS 470 WINTER 2015 FINAL PROJECT GROUP 13

have 16 entries and are indexed by 4 bits of the next PC of the incoming instruction. The branch predictor keeps track of the branch history of each branch in the branch history table (BHT). Then each branch will refer to certain entry of pattern history table (PHT) which uses a 2-bit saturation counter to make prediction. Branch target buffer records PC and target PC of branches and will give prediction of target PC of incoming branch if hit. The update information of the branch predictor comes from the ROB. When a branch is retired from the ROB, its information will be sent to update branch predictor and branch target buffer. Once an incoming branch hits the BTB and is predicted by the PHT to be taken, the branch_target_pc will be sent to instruction fetch unit and marked as valid. Return address stack is added and described after in order to improve the prediction for branches as jsr and bsr. 3. Return Address Stack (RAS) Return address stack is a component that strengthen the prediction for branches as bsr and jsr. BTB is hard with returns because a function can be called from different locations in the program so the return address changes. Using Return address stack, when there is a function call, the return address will be pushed on the stack. When a function return is called, the address will be popped out of the stack. The push and pop operation is controlled by a TOP pointer that can wrap around and the RAS is implemented to have the property of last in first out (LIFO), which is a characteristic of function call or return to increase the hit rate. There are two options of updating information is the RAS: update it with incoming call and return instructions in the fetch stage, or update it

with retired call and return instructions. The two options both have its pros and cons and in the design we applied the first option. The RAS structure we implemented is shown in Figure 4. Figure 4 RAS Structure

4. Load-Store Queue (LSQ) The structure of LSQ is in Figure 5. Load-store queue (LSQ) within our processor is of split structure with 8-entry store queue and 8-entry store-queue, the store queue is like a small RoB, with instructions issuing in order and the load queue is like a small RS, with instructions issuing out-of-order. Once instructions are dispatched from decode-stage, load and store operations will be stored within LSQ, a simple instruction distribution module is used to distribute load and store operations. Also, a similar address distribution module is used to distribute address among the two queues. An ‘age comparison & dependency’ module is used to compare the dependency between loads and stores.

The age of load/store instructions is expressed as the RoB index minus head of RoB, different from the concept of age in real life, here, smaller age means that the instructions is older. There are four states in

5

Page 6: A Three-way Superscalar R10K Microprocessor with Advanced …mjx/img/470report.pdf · 2015. 9. 16. · Three-way Superscalar, R10K, Tomasulo’s Algorithm, Out-of-Order . BTB: 16.

UNIVERSITY OF MICHIGAN – EECS 470 WINTER 2015 FINAL PROJECT GROUP 13

Table 2 LSQ dependency state

dependency between load and store, which are ‘UNKNOWN’, ‘NOT-DEPENDENT’, ‘DEPENDENT-FORWARD’ and ‘DEPENDENT-WAIT’, and the relationship between the four states

is shown as above. There is two ports for load from LSQ to D-Cache and one port for store. To ensure the correctness of store, store can only issue when it hits the head of RoB,

State Description UNKNOWN There is still load operation in the store queue with smaller age than load operation

and the store address is still unknown. In such situation, we am not sure whether the load can be forwarded from the earlier store or put into D-Cache. In this state, the load operation must stay in load queue, waiting for stores ahead being resolved.

DEPENDENT -WAIT Scanning the store queue, if load finds one store with the same address and the store is the ‘youngest’ one with the same address, load can forward data from that store without going to D-Cache for data. However, if data for that store is still unavailable, the load must also wait in the load queue for the data to forward.

DEPENDENT - FORWARD If the store data talked before is available, load is in such state and forwards data from store onto CDB.

NOT-DEPENDENT Scanning the store queue, if there is no store with the same address as the load, load cannot forward any data from store queue, and in such situation, the load is in the state of NOT-DEPENDENT.

Figure 5 LSQ structure

6

Page 7: A Three-way Superscalar R10K Microprocessor with Advanced …mjx/img/470report.pdf · 2015. 9. 16. · Three-way Superscalar, R10K, Tomasulo’s Algorithm, Out-of-Order . BTB: 16.

UNIVERSITY OF MICHIGAN – EECS 470 WINTER 2015 FINAL PROJECT GROUP 13

that’s why the structure of store queue is quite similar to RoB and must issue in-order. In contrast, once load operations in load queue is in the state of NOT-DEPENDENT, it can be issued from load queue, thus load can be issued out-of-order. 5. Data Cache/Miss Status Handle Register Our group implements both a basic 3-way d-cache and a non-blocking d-cache with miss status holding registers (MSHR). The basic settings for i-cache memory and d-cache memory are the same. The difference is that d-cache needs also to take care of memory store commands. For write-back and write-allocate mechanism, the cache load and cache store can be handled at the same time in one clock cycle. For the basic d-cache, it can load 1 entry and store 1 entry simultaneously in one clock cycle. The d-cache in our design has a higher priority than the i-cache; that is, when the d-cache to memory command is not BUS_NONE, the memory is connected to d-cache. Only when d-cache has no request will the i-cache load instructions. When eviction of one dirty entry happens, caused either by memory load or store, a flag indicating “evict is busy” is raised and sent to the pipeline to delay the i-cache to use the memory. This flag will also switch the connection of memory to d-cache. Figure 6 MSHR entry structure

For the non-blocking d-cache, in order to meet the requirements of LSQ, the d-cache can load 2 entries and store 1 entry in one clock cycle. Actually, it can load as many entries as the LSQ needs. The key of non-blocking d-cache is the MSHR (Figure 6), which holds miss-load within d-cache. The structure of MSHR is like an 8-entry-RoB with one ‘Tail’ pointer

and two ‘Head’ pointers. When loads issued from LSQ, they get into d-cache to get data, if hit, data along with instruction information like PRF index, RoB index will be broadcasted on CDB through MSHR. If miss, however, loads will be stored in MHSR entries. The buffering mechanism of MSHR is quite similar to that of ROB with ‘MSHR_head’ and ‘MSHR_tail’, the other head pointer of MSHR, named as ‘mem_head’ is used to assist MSHR in sending load to memory and storing response from memory. At first, ‘mem_head’ and ‘MSHR_head’ are both placed in entry 0, once loads are stored within MSHR entries, MSHR starts to send load to memory between ‘MSHR_head’ and ‘mem_head’. Once response received from memory, response is stored within the entry ‘mem_head’ is currently pointing to and ‘mem_head’ moves to the next entry. Upon tag along with data received from memory, tag will be compared with response stored in MSHR. Once matching, data along with load information will be broadcasted on CDB and the corresponding entry within MSHR will be set free. As the restriction of 3-way microprocessor, at most three loads can be broadcasted on CDB in one clock cycle. Load forwarding from LSQ takes up for one-way and load from memory take up for the others. Each cycle, MSHR can complete one load-miss from memory, if during the same cycle, two loads from LSQ both hit, there can be three load coming out from d-cache, which exceeds the maximum amount of loads from d-cache. To solve the problem, if load-miss comes out from MSHR, only one load can be issued from LSQ during the same cycle and in addition, if MSHR is full, a signal will be sent to LSQ to stop issuing more loads to d-cache. The ‘mispredict’ signal from RoB also controls MSHR, once mispredicted, entries within MSHR will be squashed to avoid exceptions from load. Compared to blocking D-cache, non-blocking with MSHR greatly improve the ability of Data Cache to deal with load and store instructions, and also for load miss, we do not need to wait for the long memory latency to insert new loads, in contrast, more loads can be handled at the same time within MSHR.

7

Page 8: A Three-way Superscalar R10K Microprocessor with Advanced …mjx/img/470report.pdf · 2015. 9. 16. · Three-way Superscalar, R10K, Tomasulo’s Algorithm, Out-of-Order . BTB: 16.

UNIVERSITY OF MICHIGAN – EECS 470 WINTER 2015 FINAL PROJECT GROUP 13

IV. Analysis and Performance

1. Overall performance Figure 7 shows the improvement of performance in terms of Time per Instruction (TPI). Comparing the design in project 3 of EECS 470 and the one in our project, the improvement is obvious for all of the given test cases except for objsort. It has a performance drop of 0.21%, which can be ignored. Therefore, our design greatly improves the performance. Figure 7 TPI & Improvement

The clock cycle we submitted for the project code is 13 ns. After later synthesis, we found that the optimal clock cycle should be 11.5 ns. Therefore, 11.5 ns is the clock cycle we use to calculate in the figure above. For each single module, the minimum clock cycles are all less than 9ns. We first found that the 13 ns clock cycle unacceptable since it is much longer than we had expected. After synthesis, we found the reason. It is because when we design the baseline, we tried to make each single module as fast as possible, and we enabled data forwarding for several modules. A backfire is observed when the critical path starts from ROB, through LSQ, d-cache, d-cache memory, d-cache, MSHR, LSQ, and ends at CDB. This is a disaster and the clock cycle at that time is close to 20ns. We decided to cut this long chain by disabling some data forwarding and add a pipeline stage when it is hard to disable. By the deadline, we make it to shrink the clock cycle to 13 ns. Since there are still two to three promising parts to further shrink the clock cycle, if we do that, the optimal clock cycle of our design should be less than 10 ns. We are not satisfied with our clock cycle definitely, and it is because we were not considerate to foresee this backfire of urging each single module to be over fast.

Figure 8 P3, Baseline & Final CPI Comparison

8

Page 9: A Three-way Superscalar R10K Microprocessor with Advanced …mjx/img/470report.pdf · 2015. 9. 16. · Three-way Superscalar, R10K, Tomasulo’s Algorithm, Out-of-Order . BTB: 16.

UNIVERSITY OF MICHIGAN – EECS 470 WINTER 2015 FINAL PROJECT GROUP 13

Figure 9 Blocking I-Cache and Prefetch (CPI)

2. Analysis for prefetch i-cache A prefetch i-cache makes better use the clock cycles when a cache miss happens and the IF stage waits for the memory. Comparing to the blocking i-cache, the prefetch i-cache sends the load commands of the follow-up PCs of the current miss PC to the memory during the vacant cycles of the blocking i-cache, which in theory can reduce the CPI. The figure shows that for all the test cases, the CPI is reduced. For special test cases with loops like btest1 and btest2, the CPI is reduced dramatically. (Figure 9)

3. Analysis for non-blocking d-cache When LSQ sends a load instruction to a blocking d-cache and a cache miss happens, LSQ needs to hold the input address and cannot send any other load or store instruction to the memory until the miss data is loaded. That is to say, the blocking d-cache can only support one port that can send both load and store instructions. Non-blocking d-cache, however, is not restricted by a cache miss thanks to the miss status holding register (MSHR). When a cache miss happens, the MSHR can store the memory response

Figure 10 Blocking and Non-blocking D-Cache Analysis (CPI and hit rate)

9

Page 10: A Three-way Superscalar R10K Microprocessor with Advanced …mjx/img/470report.pdf · 2015. 9. 16. · Three-way Superscalar, R10K, Tomasulo’s Algorithm, Out-of-Order . BTB: 16.

UNIVERSITY OF MICHIGAN – EECS 470 WINTER 2015 FINAL PROJECT GROUP 13

of the corresponding load address. This mechanism enables the d-cache to send more loads to the memory even if there is a cache miss. In this way, it can improve the performance for the test cases with many load misses. From the figure, we can see that, if the test case is without many loads and stores, there would be a small penalty of CPI. And if the test case contains many loads and stores, there is a distinct hit rate increase and CPI decrease. The non-blocking d-cache can support one store and as many loads as LSQ wants in one clock cycle. In our design, we use two load port, which is considered enough. Overall, the performance is improved by the non-blocking d-cache, and it is getting a good improvement result. There is a special case for the saxpy test case. The hit rate does not improve but the CPI is reduced greatly. The reason is that the multi-port non-blocking d-cache increases the chances data forwarding from load to store. This result is not expected and found from the testing result. The reason why a hit rate can reach 100% is that the

d-cache is using a write back, write allocate mechanism. If a test case stores at an address, it goes into cache memory first, and later loads for the data of this address will become a hit. 4. Analysis for Branch Predictor We compared the CPI number of our design and the design without branch predictor. Prediction hit rate was also calculated and shown in Fig. It shows clearly in the Fig. that for most cases, our predictor achieves satisfactory prediction hit rate and thus helps with the performance boost. However, for some cases like btest1, btest2 and evens, the CPI number has not changed at all. Looking into the hit rate graph, we find that for these programs, branch predictor did not help with the prediction accuracy. This is because these programs don’t have repeated pattern to provide enough information to predict, or there are limits of local history predictor that it makes prediction base on history pattern, causing damage to the performance boost.

Figure 11 Branch Predictor Hit Rate

10

Page 11: A Three-way Superscalar R10K Microprocessor with Advanced …mjx/img/470report.pdf · 2015. 9. 16. · Three-way Superscalar, R10K, Tomasulo’s Algorithm, Out-of-Order . BTB: 16.

UNIVERSITY OF MICHIGAN – EECS 470 WINTER 2015 FINAL PROJECT GROUP 13

Figure 12 CPI Comparison

5. Analysis for RAS Performance impact of return address stack is then analysed and plotted in Fig. The return address stack only has effect on the test programs having function calls and returns, which are fib_rec and objsort. From Fig. we can see that CPI of fib_rec case drop by 10.6% with the help of return address stack. In fib_rec case, bsr instruction repeatedly calls fib function and returns. When it returns, the RAS will pop out the correct return address, thus can greatly decrease the cycle penalties of branch misprediction. However, for objsort case, the CPI number gets worse unexpectedly. Possible reasons are: (1) we store PC+4 as the return address when implementing RAS. Unfortunately, the objsort case contains some return instructions, like ret $r1, which returns to the address stored in register 1 and can’t be predicted correctly; (2) there are intensive function calls and branch mispredictions in this case, we need sophisticated recovery mechanism to handle mispredictions to improve RAS prediction correctness.

Figure 13 CPI Comparison of design with and without RAS

6. Analysis for LSQ Data forwarding for LSQ can help the microprocessor save time of memory latency, it also has some side effects – load has to wait in load queue for dependency resolved. For our test cases, some cases have no load instructions (for example btest1.s, btest2.s and evens.s), so data forwarding has no improvement on such test cases. Some test cases have no opportunity of data forwarding as load address is

11

Page 12: A Three-way Superscalar R10K Microprocessor with Advanced …mjx/img/470report.pdf · 2015. 9. 16. · Three-way Superscalar, R10K, Tomasulo’s Algorithm, Out-of-Order . BTB: 16.

UNIVERSITY OF MICHIGAN – EECS 470 WINTER 2015 FINAL PROJECT GROUP 13

different as the store ahead (for example saxpy.s and parsort.s). For copy.s, fib.s and objsort.s, date forwarding from LSQ does help the microprocessor decrease and the effect is illustrated in Figure 14. As we can see, all the load in copy can be forwarded, and CPI will decrease due to data forwarding. As with copy, 50% of loads in fib and 7% of loads in objsort can be forwarded, thus the CPI of the two test cases will decrease to some degree. Figure 14 LSQ-Data Forwarding

7. GUI Debugger We implemented a GUI debugger based on the given template and it acts as a powerful tool especially in the period of high level debugging.

V. Discussion Though our design can achieve satisfactory result for most test cases, it still suffers from performance penalty caused by branch misprediction and can be improved by more advanced features. Several possible future improvements are proposed and discussed below: 1. A better branch predictor The advantage of a better branch predictor is huge since it could increase the accuracy of the branch prediction. Because every misprediction will cause squash with huge penalty, better predictor is expected to reduce the squash in front-end stages and improve the CPI performance to a great extent. Combined predictors using local and global history can be used. Also, other types of predictor like G-share, global and

bimodal can be used for comparison and analysis. 2. Try other block size for i-cache and d-cache The current block size for both i-cache and d-cache is 8 bytes, which is the same size as one entry of data read from memory. Therefore, the current caches are not taking advantage of spatial locality at all. If the block size is set up to 16 bytes of 32 bytes, there could be potential performance improvement for the microprocessor. 3. Add early branch resolution scheme Early branch resolution can decrease the cycle penalty of branch mispredictions. It can be expected that if early branch resolution scheme added, CPI number of programs with a lot of branch. 4. Add more aggressive LSQ Our current LSQ is relatively conservative, we can only issue load operations in load queue once its dependency status is resolved. We can improve it by making a more aggressive LSQ, even though the dependency status of load is UNKNOWN, we can directly send the load operation to D-cache for data. If the load’s dependency status changes to DEPENDENT, we must invalidate the load data from D-Cache/Memory and forward the data from store. Such algorithm only save CPI when the UNKONWN load is really UNKNOWN and we save some clock cycles without waiting in load queue for dependency update. If we predict incorrectly, the load is actually DEPENDENT, we don’t waste clock cycle in reality. Thus, through theoretical analysis, CPI will decrease with aggressive LSQ, however in respect to clock period, aggressive LSQ will increase the complexity of LSQ and eventually increase clock period. At moment, LSQ lies on the critical path of synthesize, and adopting LSQ must be harm to clock period for our design. Besides, aggressive LSQ will lay more burden on D-cache as more load operation comes in and the buffer of MSHR may not be enough to hold large amount of load miss.

12

Page 13: A Three-way Superscalar R10K Microprocessor with Advanced …mjx/img/470report.pdf · 2015. 9. 16. · Three-way Superscalar, R10K, Tomasulo’s Algorithm, Out-of-Order . BTB: 16.

UNIVERSITY OF MICHIGAN – EECS 470 WINTER 2015 FINAL PROJECT GROUP 13

As a result, it’s a trade-off between clock period and CPI, and in the future, we can have a trial on the aggressive and explore whether it could improve the performance of microprocessor overall. 5. Add more sophisticated recovery mechanism to RAS Our return address stack did not achieve reasonable prediction hit rate for test case objsort since it lacks powerful recovery mechanism to handle branch misprediction. If combining sophisticated recovery mechanism, the RAS can flush undesired entries and

recover correct return address sequence, it will definitely boost performance for program with function calls and returns.

CONTRIBUTIONS

Name Percentage Contribution Tan Bie 20% RoB, LSQ, Non-blocking D-Cache with MSHR,

Testing, Synthesize Optimization Yang Jiao 20% RAT/RRAT, PRF, CDB, I-Cache with pre-fetch, testing

and debugging Du Lyu 20% RS, Branch Predictor, RAS, IF/ID Stage, GUI

Debugger, Debugging Mengjiao Xu 20% RS, Branch Predictor/BTB, EX Stage, Pipeline

integration and debugging Zhixun Zhao 20% Blocking I-Cache, Blocking D-Cache, Non-blocking

D-Cache, Pipeline Correctness and Performance Testing

REFERENCES [1] J. L. Hennessy & D. A. Patterson, Computer Archi-tecture: A quantitative Approach, 4th ed, San Francisco, CA, Morgan Kaufman Publishers. [2] M. Brehob. R10K scheme. University of Michigan. EECS 470: Computer Architecture. Ann Arbor, Michi-gan. Feb, 2012. [3] M. Brehob. Cache. University of Michigan. EECS 470: Computer Architecture. Ann Arbor, Michigan. March, 2012. [4] Alpha Architecture Handbook, 4th ed, Compaq Computer Corp. Houston, Texas. Pp. 4-1 – 4-61. [5] S. McFarling, “Combining Branch Predictors”, WRL Technical Note TN-36, June 1993. [6] Kevin Skadron, “Improving Prediction for Procedure Returns with Return-Address-Stack Repair Mechanisms”, Princeton University, July 2001.

13

Page 14: A Three-way Superscalar R10K Microprocessor with Advanced …mjx/img/470report.pdf · 2015. 9. 16. · Three-way Superscalar, R10K, Tomasulo’s Algorithm, Out-of-Order . BTB: 16.

UNIVERSITY OF MICHIGAN – EECS 470 WINTER 2015 FINAL PROJECT GROUP 13

APPENDIX 1. Comparison of performance

CPI/TPI P3 Ours P3 Ours Improvement btest1 1.86 2.44 55.81 28.07 49.70% btest2 1.85 2.38 55.65 27.42 50.72% copy 1.75 0.96 52.62 11.06 78.98%

copy_long 1.17 0.58 34.98 6.63 81.06% evens 2.00 2.79 60.00 32.12 46.47%

evens_long 1.26 0.82 37.74 9.47 74.89% fib 1.52 0.89 45.51 10.25 77.48%

fib_long 1.14 0.93 34.20 10.64 68.89% fib_rec 1.99 1.74 59.68 20.02 66.45%

insertion 1.65 1.37 49.62 15.81 68.14% mult 1.37 1.77 40.98 20.35 50.36%

mult_nolq 1.18 2.08 35.40 23.87 32.57% objsort 2.62 6.86 78.73 78.89 -0.21% parallel 1.34 0.74 40.21 8.48 78.92%

parallel_long 1.07 0.54 32.18 6.22 80.68% partsort 1.35 1.65 40.51 18.96 53.20%

saxpy 1.81 2.63 54.16 30.21 44.22% sort 1.76 1.65 52.68 18.96 64.02%

2. CPI comparison for project 3 design, our design with and without prefetch and non-blocking d-cache

CPI project3 baseline prefetch i-cache and non-blocking d-cache

btest1 1.8603 10.6550 2.4410 btest2 1.8549 8.8484 2.3846 copy 1.7538 1.3538 0.9615

copy_long 1.1661 0.9864 0.5763 evens 2.0000 3.3171 2.7927

evens_long 1.2579 1.4340 0.8239 fib 1.5170 1.4898 0.8912

fib_long 1.1400 1.3301 0.9250 fib_rec 1.9893 1.6568 1.7412

insertion 1.6538 1.5017 1.3746 mult 1.3662 2.1169 1.7692

mult_no_lsq 1.1799 2.7158 2.0755 objsort 2.6244 6.5928 6.8603 parallel 1.3402 1.0412 0.7371

parallel_long 1.0725 0.8626 0.5407

14

Page 15: A Three-way Superscalar R10K Microprocessor with Advanced …mjx/img/470report.pdf · 2015. 9. 16. · Three-way Superscalar, R10K, Tomasulo’s Algorithm, Out-of-Order . BTB: 16.

UNIVERSITY OF MICHIGAN – EECS 470 WINTER 2015 FINAL PROJECT GROUP 13

partsort 1.3504 2.1284 1.6485 saxpy 1.8054 3.8649 2.6270 sort 1.7561 1.6242 1.6485

3. Comparison of CPI and hit rate for blocking and non-blocking d-cache

CPI/ Hit Rate blocking d-cache non-blocking d-cache blocking d-cache non-blocking d-cache btest1 10.6550 10.7445 100.00% 100.00% btest2 8.8484 8.8852 100.00% 100.00%

copy_long 0.9864 0.9864 100.00% 100.00% copy 1.3538 1.3566 100.00% 100.00%

evens_long 1.4340 1.4353 100.00% 100.00% evens 3.3171 3.3457 100.00% 100.00%

fib_long 1.3126 1.3152 100.00% 100.00% fib_rec 1.7748 1.7049 89.35% 96.07%

fib 1.4898 1.4110 100.00% 100.00% insertion 1.5017 1.4161 55.80% 80.98%

mult_no_lsq 2.7158 2.7220 100.00% 100.00% mult 2.1169 2.1204 72.34% 100.00%

objsort 6.2531 5.6997 22.91% 78.58% parallel_long 0.8626 0.8625 100.00% 100.00%

parallel 1.0412 1.0415 100.00% 100.00% saxpy 3.8649 2.8641 16.37% 0.00% sort 1.6160 1.4736 73.26% 93.51%

4. Comparison of CPI between blocking i-cache and prefetch i-cache

CPI blocking i-cache prefetch i-cache btest1 10.6550 2.4279 btest2 8.8484 2.3802

copy_long 0.9864 0.5678 copy 1.3538 0.9385

evens_long 1.4340 0.8805 evens 3.3171 2.7805

fib_long 1.3126 1.1021 fib_rec 1.7748 1.6504

fib 1.4898 0.8707 insertion 1.5017 1.5134

mult_no_lsq 2.7158 2.0647 mult 2.1169 1.7938

objsort 6.2531 5.8960

15

Page 16: A Three-way Superscalar R10K Microprocessor with Advanced …mjx/img/470report.pdf · 2015. 9. 16. · Three-way Superscalar, R10K, Tomasulo’s Algorithm, Out-of-Order . BTB: 16.

UNIVERSITY OF MICHIGAN – EECS 470 WINTER 2015 FINAL PROJECT GROUP 13

parallel_long 0.8626 0.5835 parallel 1.0412 0.7268 saxpy 3.8649 3.6432 sort 1.6160 1.6175

5. Comparison of branch hit rate with and without 2-bit local branch predictor

Hit Rate project3 baseline with branch predictor btest1 25.01% 26.42% btest2 25.00% 25.22%

copy_long 6.26% 81.25% copy 6.26% 77.79%

evens_long 28.13% 63.64% evens 28.13% 28.13%

fib_long 7.15% 7.15% fib_rec 34.40% 63.02%

fib 7.15% 75.01% insertion 53.29% 74.41%

mult_no_lsq 6.30% 82.36% mult 5.89% 82.35%

objsort 1.15% 52.61% parallel_long 6.26% 81.25%

parallel 6.26% 76.47% saxpy 5.27% 76.19% sort 31.14% 63.52%

6. GUI Debugger

16

Page 17: A Three-way Superscalar R10K Microprocessor with Advanced …mjx/img/470report.pdf · 2015. 9. 16. · Three-way Superscalar, R10K, Tomasulo’s Algorithm, Out-of-Order . BTB: 16.

UNIVERSITY OF MICHIGAN – EECS 470 WINTER 2015 FINAL PROJECT GROUP 13

17


Recommended