CSEN601 Spring2011 - Practice Assignment 1
1
PROBLEM 1:
An application running on a 1GHz pipelined processor has the following instruction mix:
Instruction Frequency CPI
Load-store 55% 5
Arithmetic 30% 4
Branch 15%
4
a) Determine the overall CPI of the program.
b) An embedded version of the processor that operates at 600 MHz is used to run the same application. In this version, the CPI of branch
instruction becomes 6 while the other types CPI remain unchanged. A new compiler is used which eliminates 25% of the load-store
instructions as well as 5% of the arithmetic instructions for this application.
i. Determine the overall CPI of the program on the embedded processor with the new compiler.
ii. Determine the factor by which the application on the embedded processor runs faster/slower.
Solution:
a) cycle/instruction
b) First we calculate the new percentages for each type of instruction:
i. Percentage of eliminated load-store from total instructions
=
Percentage of eliminated arithmetic from total instructions
=
Percentage of remaining instructions from total instructions
= ( )
New percentage of load-store instructions
CSEN601 Spring2011 - Practice Assignment 1
2
=
New percentage of arithmetic instructions
=
New percentage of branch instructions
=
cycle/instr.
ii.
(
)
(i.e the program now is slower)
PROBLEM 2:
1- Suppose a MIPS processor uses the simple 5-stage pipeline described in the text. Further suppose that:
There is a single memory for both instructions and data which can do one read or write each cycle.
No forwarding is used.
An instruction cannot be fed into the pipeline until the hardware knows the instruction is to be executed certainly (no earlier than
the end of the execution stage in case the current instruction is a branch).
In the absence of hazards a new instruction can be fed into the pipeline each cycle.
For the following MIPS code:
lw R1, 0(R2)
lw R3, 12(R4)
add R5, R1, R3
beq R5, R5, L1
sw R5, 0(R3)
L1: sw R5, 12(R4)
CSEN601 Spring2011 - Practice Assignment 1
3
a) Show using a diagram, how many cycles does this code take to complete?
b) Show using a diagram, how different hazard solving techniques can be used to decrease the total number of cycles for this program.
Solution:
a) As shown below, the code will take 15 cycles.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
lw R1,0(R2) IF ID EX M WB lw R3,12(R4) IF ID EX M WB Add R5,R1,R3 IF ID EX M WB beq R5,R5,L1 IF ID EX M WB L1:sw R5,12(R4) IF ID EX M WB
b) Using the following hazard solving techniques:
Forwarding (to resolve some data hazards)
Separate instruction and data memories (to resolve some structural hazards)
Branch prediction
Assuming branch prediction turns out to be correct, the code will take 11 cycles.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
lw R1,0(R2) IF ID EX M WB
lw R3,12(R4) IF ID EX M WB
Add R5,R1,R3 IF ID EX M WB
beq R5,R5,L1 IF ID EX M WB
L1:sw R5,12(R4) IF ID EX M WB
PROBLEM 3
CSEN601 Spring2011 - Practice Assignment 1
4
2- A five-stage pipelined processor supports the following instruction types:
Instruction Frequency
Load 25% Store 15% Integer 30% Floating point 20% Branch 10%
Assume the base CPI of the processor is equal to 1. Data hazards for floating point operations cause an average penalty of 0.9 stall cycles, branch
instructions have a misprediction penalty of 1 stall cycle, while all other instructions run at maximum possible throughput. For branch instructions,
the processor uses the predicted untaken scheme. If branch prediction turns out to be true 80% of the time, calculate the average CPI for this
program.
Solution:
The average CPI = the base CPI + The average number of stalls per instruction
= ( ) cycle/instr.
CSEN601 Spring2011 - Practice Assignment 1
5
PROBLEM 4:
a) Identify all WAR, WAW and RAW dependencies in the following instruction sequence:
LD F2, 16(R6)
ADDD F2, F2, F4
DIVD F6, F2, F0
SUBD F0, F2, F10
SD F6, 32(R3)
b) Fill in the blank templates for executing this code with and without Tomasulo’s Algorithm for this instruction sequence.
Assume the following execution times:
LW: 2 cycles ADD/SUB: 2 cycles BNEZ: 3 cycles MULT/DIV: 4 cycles
For the original FP unit, assume one integer unit, one floating point multiply units, one F.P. add unit, one F.P. divide unit.
For Tomasulo’s, assume:
Three FP ADD units, 2 FP MULT units, 6 load buffers and three store buffers. (Same units as in book example)
Assume there is a cache miss causing a stall of 8 cycles on the execution of the 1st LD.
Assume FP adds/subs take 2 cycles, Mults take 10 cycles and Divides take 20 cycles.
Assume the store is a cache hit and executes in one cycle.
Assume many instructions can read from the register file simultaneously.
For the Tomasulo example, recall that only one instruction can drive the CDB at a time.
Solution:
Without Tomasulo’s Algorithm, and the processor is using Forwarding:
LD F2,16(R6) IF ID EX MEM1 MEM2 WB
ADDD F2,F2,F4 IF ID stall stall EX1 EX2 MEM WB
CSEN601 Spring2011 - Practice Assignment 1
6
DIVD F6,F2,F0 IF stall stall ID stall EX1 EX2 EX3 EX4 MEM WB
SUBD F0,F2,F10 stall stall IF stall ID stall stall stall EX1 EX2 MEM WB
SD F6,32(R3) stall stall stall IF stall stall stall ID stall EX MEM1 MEM2 WB
Notes:
We considered we 1 execution unit and 1 memory unit and we had to respect this in order execution and in order completion to
solve the stalls exactly as shown in slide 5 of the ILP chapter.
With Tomasulo’s Algorithm:
We will use the same architecture shown in the lecture
Instruction status: Exec Write
Instruction j k Issue Comp Result Busy Address
LD F2 16 R2 Load1 No
ADDD F2 F2 F4 Load2 No
DIVD F6 F2 F0 Load3 No
SUBD F0 F2 F10
SD F6 32 R3
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 NO
Add3 No
Mult1 NO
Mult2 NO
Register result status:
Clock F0 F2 F4 F6 F8 F100 FU
Instruction status: Exec Write
Instruction j k Issue Comp Result Busy Address
LD F2 16 R2 1 2 Load1 Yes 16(R2)
ADDD F2 F2 F4 Load2 No
DIVD F^ F2 F0 Load3 No
SUBD F0 F2 F10
SD F6 32 R3
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 No
Add2 NO
Add3 No
Mult1 NO
Mult2 NO
Register result status:
Clock F0 F2 F4 F6 F8 F101 FU Load1
CSEN601 Spring2011 - Practice Assignment 1
7
Instruction status: Exec Write
Instruction j k Issue Comp Result Busy Address
LD F2 16 R2 1 1 Load1 Yes 16(R2)
ADDD F2 F2 F4 2 Load2 No
DIVD F^ F2 F0 Load3 No
SUBD F0 F2 F10
SD F6 32 R3
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 YES ADD F4 Load1
Add2 NO
Add3 No
Mult1 NO
Mult2 NO
Register result status:
Clock F0 F2 F4 F6 F8 F102 FU ADD1
Instruction status: Exec Write
Instruction j k Issue Comp Result Busy Address
LD F2 16 R2 1 3 0 Load1 Yes 16(R2)
ADDD F2 F2 F4 2 Load2 No
DIVD F6 F2 F0 3 Load3 No
SUBD F0 F2 F10
SD F6 32 R3
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
Add1 YES ADD F4 Load1
Add2 NO
Add3 No
Mult1 YES DIVD F0 ADD1
Mult2 NO
Register result status:
Clock F0 F2 F4 F6 F8 F103 FU ADD1 MULT1
Instruction status: Exec Write
Instruction j k Issue Comp Result Busy Address
LD F2 16 R2 1 3 4 0 Load1 NO
ADDD F2 F2 F4 2 Load2 No
DIVD F6 F2 F0 3 Load3 No
SUBD F0 F2 F10 4
SD F6 32 R3
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
2 Add1 YES ADD MEM(1) F4
Add2 YES SUBD F10 ADD1
Add3 No
Mult1 YES DIVD F0 ADD1
Mult2 NO
Register result status:
Clock F0 F2 F4 F6 F8 F104 FU ADD2 ADD1 MULT1
LD F2 16 R2 1 3 4 0 Load1 NO
ADDD F2 F2 F4 2 Load2 No
DIVD F6 F2 F0 3 Load3 No
SUBD F0 F2 F10 4
SD F6 32 R3 5
Store Yes 32(R3) MULT1
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
1 Add1 YES ADD MEM(1) F4
Add2 NO SUBD F10 ADD1
Add3 No
Mult1 YES DIVD F0 ADD1
Mult2 NO
Register result status:
Clock F0 F2 F4 F6 F8 F105 FU ADD2 ADD1 MULT1
CSEN601 Spring2011 - Practice Assignment 1
8
LD F2 16 R2 1 3 4 0 Load1 NO
ADDD F2 F2 F4 2 6 Load2 No
DIVD F6 F2 F0 3 Load3 No
SUBD F0 F2 F10 4
SD F6 32 R3 5
Store Yes 32(R3) MULT1
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
0 Add1 YES ADD MEM(1) F4
Add2 NO SUBD F10 ADD1
Add3 No
Mult1 YES DIVD F0 ADD1
Mult2 NO
Register result status:
Clock F0 F2 F4 F6 F8 F106 FU ADD2 ADD1 MULT1
LD F2 16 R2 1 3 4 0 Load1 NO
ADDD F2 F2 F4 2 6 7 Load2 No
DIVD F6 F2 F0 3 Load3 No
SUBD F0 F2 F10 4
SD F6 32 R3 5
Store Yes 32(R3) MULT1
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
0 Add1 No
2 Add2 YES SUBD res1 F10
Add3 No
4 Mult1 YES DIVD res1 F0
Mult2 NO
Register result status:
Clock F0 F2 F4 F6 F8 F107 FU ADD2 res1 MULT1
LD F2 16 R2 1 3 4 0 Load1 NO
ADDD F2 F2 F4 2 6 7 Load2 No
DIVD F6 F2 F0 3 Load3 No
SUBD F0 F2 F10 4
SD F6 32 R3 5
Store Yes 32(R3) MULT1
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
0 Add1 No
1 Add2 YES SUBD res1 F10
Add3 No
3 Mult1 YES DIVD res1 F0
Mult2 NO
Register result status:
Clock F0 F2 F4 F6 F8 F108 FU ADD2 (RES) MULT1
LD F2 16 R2 1 3 4 0 Load1 NO
ADDD F2 F2 F4 2 6 7 Load2 No
DIVD F6 F2 F0 3 Load3 No
SUBD F0 F2 F10 4
SD F6 32 R3 5
Store Yes 32(R3) MULT1
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
0 Add1 No
1 Add2 NO SUBD res1 F10
Add3 No
3 Mult1 YES DIVD res1 F0
Mult2 NO
Register result status:
Clock F0 F2 F4 F6 F8 F109 FU ADD2 (RES) MULT1
CSEN601 Spring2011 - Practice Assignment 1
9
LD F2 16 R2 1 3 4 0 Load1 NO
ADDD F2 F2 F4 2 6 7 Load2 No
DIVD F6 F2 F0 3 Load3 No
SUBD F0 F2 F10 4 10
SD F6 32 R3 5
Store Yes 32(R3) MULT1
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
0 Add1 NO
0 Add2 YES SUBD res1 F10
Add3 No
2 Mult1 YES DIVD res1 F0
Mult2 NO
Register result status:
Clock F0 F2 F4 F6 F8 F1010 FU ADD2 res1 MULT1
ADDD F2 F2 F4 2 6 7 Load2 No
DIVD F6 F2 F0 3 Load3 No
SUBD F0 F2 F10 4 10 11
SD F6 32 R3 5
Store Yes 32(R3) MULT1
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
0 Add1 No
0 Add2 No
Add3 No
1 Mult1 YES DIVD res1 F0
Mult2 NO
Register result status:
Clock F0 F2 F4 F6 F8 F1011 FU res2 res1 MULT1
LD F2 16 R2 1 3 4 0 Load1 NO
ADDD F2 F2 F4 2 6 7 Load2 No
DIVD F6 F2 F0 3 12 Load3 No
SUBD F0 F2 F10 4 10 11
SD F6 32 R3 5
Store Yes 32(R3) MULT1
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
0 Add1 No
0 Add2 No
Add3 No
0 Mult1 YES DIVD res1 F0
Mult2 NO
Register result status:
Clock F0 F2 F4 F6 F8 F1012 FU res2 res1 MULT1
LD F2 16 R2 1 3 4 0 Load1 NO
ADDD F2 F2 F4 2 6 7 Load2 No
DIVD F6 F2 F0 3 12 13 Load3 No
SUBD F0 F2 F10 4 10 11
SD F6 32 R3 5 Time: 2
Store Yes 32(R3) Res3
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
0 Add1 No
0 Add2 No
Add3 No
0 Mult1 No
Mult2 NO
Register result status:
Clock F0 F2 F4 F6 F8 F1013 FU res2 res1 Res3
CSEN601 Spring2011 - Practice Assignment 1
10
LD F2 16 R2 1 3 4 0 Load1 NO
ADDD F2 F2 F4 2 6 7 Load2 No
DIVD F6 F2 F0 3 12 13 Load3 No
SUBD F0 F2 F10 4 10 11
SD F6 32 R3 5 Time: 1
Store Yes 32(R3) Res3
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
0 Add1 No
0 Add2 No
Add3 No
0 Mult1 No
Mult2 NO
Register result status:
Clock F0 F2 F4 F6 F8 F1014 FU res2 res1 Res3
LD F2 16 R2 1 3 4 0 Load1 NO
ADDD F2 F2 F4 2 6 7 Load2 No
DIVD F6 F2 F0 3 12 13 Load3 No
SUBD F0 F2 F10 4 10 11
SD F6 32 R3 5 15 Time: 0
Store Yes 32(R3) Res3
Reservation Stations: S1 S2 RS RS
Time Name Busy Op Vj Vk Qj Qk
0 Add1 No
0 Add2 No
Add3 No
0 Mult1 No
Mult2 NO
Register result status:
Clock F0 F2 F4 F6 F8 F1015 FU res2 res1 Res3
CSEN601 Spring2011 - Practice Assignment 1
11
PROBLEM 5:
Consider the following code. (The .... marks indicate instructions that are ignored in this example) LOOP1: ADDI R4, R0, #4
.......
LOOP 2: SUBI R4, R4, #1
.......
BNEZ R4, LOOP2
.......
BEQZ R8, LOOP1
.......
a) Focusing on the inner loop (LOOP2) only, analyze the branch behavior. Assume no other instruction changes the value of register R4. What
percentage of the time is the BNEZ branch instruction taken and not taken?
Consider LOOP2 is taken N times, so it is easy to deduce that the branch will be taken N times in each N+1 iterations, i.e. the loop will be
taken N/N+1 and not taken 1/N+1
Consider LOOP2 is taken N times, so it is easy to deduce that the branch will be taken N times in each N+1 iterations, i.e. the loop will be taken N/N+1 and not taken 1/N+1
b) Choose the best static branch prediction scheme for the BNEZ instruction. What percentage of the time will this static branch prediction be
correct for LOOP2?
Using Branch taken, we will reach N correct iterations out of every N+1 decisions.
c) Now consider dynamic branch prediction. Draw the state machine for a one-bit branch predictor. Be sure to clearly identify or define the
meaning of each state. For the inner loop (LOOP2), what will be the misprediction rate of the one-bit branch predictor?
CSEN601 Spring2011 - Practice Assignment 1
12
For 1 bit branch predictor the FSM should look as above, studying LOOP2 only,
Iteration 1 2 3 . … .. .. N N+1
Prediction Decision Not Taken Taken Taken Taken Taken Taken Taken Taken
Final Decision Taken Taken Taken Taken Taken Taken Taken Taken Not
So, we would take wrong decision 2 times out of every N+1 times
d) Now draw the state diagram for a 2-bit dynamic branch predictor. Again, clearly label all states. What will be the misprediction rate of the 2-
bit branch predictor for LOOP2?
Iteration 1 2 3 . … .. .. N N+1
Prediction Decision Not Not Taken Taken Taken Taken Taken Taken Taken
Final Decision Taken Taken Taken Taken Taken Taken Taken Taken Not
Not Taken
Taken
Taken
Not Taken
CSEN601 Spring2011 - Practice Assignment 1
13
e) Taking both loops in consideration, the state diagram for a 2,2 bit collator type dynamic branch predictor.
We will not use 2,2 as it is not described in the lecture, so we will just take the relation between both loop1 and loop2. So, if we consider LOOP2 is executed N times every LOOP 1 Iteration. It is clear that for the 1st loop iteration prediction will have 3 misses then it will be only 1 miss until the end of loop1