11/1/2005 Comp 120 Fall 2005 1
1 November
• Exam Postmortem• 11 Classes to go!• Read Sections 7.1 and 7.2• You read 6.1 for this time. Right?• Pipelining then on to Memory hierarchy
11/1/2005 Comp 120 Fall 2005 2
1
[6] Give as least 3 different formulas for execution time using the following variables. Each equation should be minimal; that is, it should not contain any variable that is not needed. CPI, R=clock rate, T=cycle time, M=MIPS, I=number of instructions in program, C=number of cycles in program.
CPI*I*T, CPI*I/R, C*T, C/R, I/(MIPS*10^6)
11/1/2005 Comp 120 Fall 2005 3
2
[12] Soon we will all have computers with multiple processors. This will allow us to run some kinds of programs faster. Suppose your computer has N processors and that a program you want to make faster has some fraction F of its computation that is inherently sequential (either all the processors have to do the same work thus duplicating effort, or the work has to be done by a single processor while the other processors are idle). What speedup can you expect to achieve on your new computer?
1/(F + (1-F)/N)
11/1/2005 Comp 120 Fall 2005 4
3
[10] Show the minimum sequence of instructions required to sum up an array of 5 integers, stored as consecutive words in memory, given the address of the first word in $a0; leaving the sum in register $v0?
5 loads + 4 adds = 9 instructions
11/1/2005 Comp 120 Fall 2005 5
4
[10] Which of the following cannot be EXACTLY represented by an IEEE double-precision floating-point number? (a) 0, (b) 10.2, (c) 1.625, (d) 11.5
11/1/2005 Comp 120 Fall 2005 6
5
• [12] All of the following equations are true for the idealized numbers you studied in algebra. Which ones are true for IEEE floating point numbers? Assume that all of the numbers are well within the range of largest and smallest possible numbers (that is, underflow and overflow are not a problem)– A+B=A if and only if B=0
– (A+B)+C = A+(B+C)
– A+B = B+A – A*(B+C) = A*B+A*C
11/1/2005 Comp 120 Fall 2005 7
6
[10] Consider the characteristics of two machines M1 and M2. M1 has a clock rate of 1GHz. M2 has a clock rate of 2GHz. There are 4 classes of instructions (A-D) in the instruction set. In a set of benchmark programs, the frequency of each class of instructions is shown in the table.
Instruction Class Frequency M1 CPI M2 CPI A 40% 2 6 B 25% 3 6 C 20% 3 6 D 15% 5 8
What is the average CPI for each machine? CPI1 = 2*.4+3*.25+3*.2+5*.15 = 2.9; CPI2 = 6*.4+6*.25+6*.2+8*.15 = 6.3. Which machine is faster? M1 By what factor faster is it? (1000*6.3)/(2000*2.9) = 1.086. What is the cycle time of each machine? M1=1ns, M2=0.5ns.
11/1/2005 Comp 120 Fall 2005 8
7
[8] Draw a block diagram showing how to implement 2’s complement subtraction of 3 bit numbers (A minus B) using 1-bit full-adder blocks with inputs A, B, and Cin and outputs Sum and Cout and any other AND, OR, or INVERT blocks you may need.
11/1/2005 Comp 120 Fall 2005 9
8
[10] Assume that multiply instructions take 12 cycles and account for 10% of the instructions in a typical program and that the other 90% of the instructions require an average of 4 cycles for each instruction. What percentage of time does the CPU spend doing multiplication?
12*0.1/(12*0.1+4*0.9)=25%
11/1/2005 Comp 120 Fall 2005 10
9
[10] In a certain set of benchmark programs about every 4th instruction is a load instruction that fetches data from main memory. The time required for a load is 50ns. The CPI for all other instructions is 4. Assuming the ISA’s are the same, how much faster will the benchmarks run with a 1GHz clock than with a 500MHz clock?
For 1GHz: 50 + 12 = 62 cycles and 62 nanoseconds
For 500MHz: 25 + 12 = 37 cycles and 74 nanoseconds
Speedup is 74 / 62 = 1.19
11/1/2005 Comp 120 Fall 2005 11
10
[12] Explain why floating point addition and subtraction may result in loss of precision but multiplication and division do not.
Before we can add we have to adjust the binary point so that the exponents are equal. This can cause many bits to be lost if the exponents are very different.
Multiplication doesn’t have this problem. All the bits play equally.
11/1/2005 Comp 120 Fall 2005 12
Results
0
1
2
3
4
5
6
7
<60 61-70 71-80 81-90 91-100
Average: 72.5
Median: 81
11/1/2005 Comp 120 Fall 2005 13
Chapter 6 Pipelining
11/1/2005 Comp 120 Fall 2005 14
Doing LaundryTime
76 PM 8 9 10 11 12 1 2 AM
A
B
C
D
Time76 PM 8 9 10 11 12 1 2 AM
A
B
C
D
Taskorder
Taskorder
11/1/2005 Comp 120 Fall 2005 15
Pipelining
• Improve performance by increasing instruction throughput
Ideal speedup is number of stages in the pipeline. Do we achieve this?
Instructionfetch
Reg ALUData
accessReg
8 nsInstruction
fetchReg ALU
Dataaccess
Reg
8 nsInstruction
fetch
8 ns
Time
lw $1, 100($0)
lw $2, 200($0)
lw $3, 300($0)
2 4 6 8 10 12 14 16 18
2 4 6 8 10 12 14
...
Programexecutionorder(in instructions)
Instructionfetch
Reg ALUData
accessReg
Time
lw $1, 100($0)
lw $2, 200($0)
lw $3, 300($0)
2 nsInstruction
fetchReg ALU
Dataaccess
Reg
2 nsInstruction
fetchReg ALU
Dataaccess
Reg
2 ns 2 ns 2 ns 2 ns 2 ns
Programexecutionorder(in instructions)
11/1/2005 Comp 120 Fall 2005 16
• We have 5 stages. What needs to be controlled in each stage?– Instruction Fetch and PC Increment
– Instruction Decode / Register Fetch
– Execution
– Memory Stage
– Register Write Back
• How would control be handled in an automobile plant?– a fancy control center telling everyone what to do?
– should we use a finite state machine?
Pipeline control
11/1/2005 Comp 120 Fall 2005 17
Pipelining
• What makes it easy– all instructions are the same length
– just a few instruction formats
– memory operands appear only in loads and stores
• What makes it hard?– structural hazards: suppose we had only one memory
– control hazards: need to worry about branch instructions
– data hazards: an instruction depends on a previous instruction
• Individual Instructions still take the same number of cycles
• But we’ve improved the through-put by increasing the number of simultaneously executing instructions
11/1/2005 Comp 120 Fall 2005 18
Structural Hazards
Inst
Fetch
Reg
Read
ALU Data
Access
Reg Write
Inst
Fetch
Reg
Read
ALU Data
Access
Reg Write
Inst
Fetch
Reg
Read
ALU Data
Access
Reg Write
Inst
Fetch
Reg
Read
ALU Data
Access
Reg Write
11/1/2005 Comp 120 Fall 2005 19
• Problem with starting next instruction before first is finished– dependencies that “go backward in time” are data hazards
Data Hazards
IM Reg
IM Reg
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6
Time (in clock cycles)
sub $2, $1, $3
Programexecutionorder(in instructions)
and $12, $2, $5
IM Reg DM Reg
IM DM Reg
IM DM Reg
CC 7 CC 8 CC 9
10 10 10 10 10/– 20 – 20 – 20 – 20 – 20
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
Value of register $2:
DM Reg
Reg
Reg
Reg
DM
11/1/2005 Comp 120 Fall 2005 20
• Have compiler guarantee no hazards• Where do we insert the “nops” ?
sub $2, $1, $3and $12, $2, $5or $13, $6, $2add $14, $2, $2sw $15, 100($2)
• Problem: this really slows us down!
Software Solution
11/1/2005 Comp 120 Fall 2005 21
• Use temporary results, don’t wait for them to be written– register file forwarding to handle read/write to same register
– ALU forwarding
Forwarding
IM Reg
IM Reg
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6
Time (in clock cycles)
sub $2, $1, $3
Programexecution order(in instructions)
and $12, $2, $5
IM Reg DM Reg
IM DM Reg
IM DM Reg
CC 7 CC 8 CC 9
10 10 10 10 10/– 20 – 20 – 20 – 20 – 20
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
Value of register $2 :
DM Reg
Reg
Reg
Reg
X X X – 20 X X X X XValue of EX/MEM :X X X X – 20 X X X XValue of MEM/WB :
DM
11/1/2005 Comp 120 Fall 2005 22
• Load word can still cause a hazard:– an instruction tries to read a register following a load instruction that writes to
the same register.
–
• Thus, we need a hazard detection unit to “stall” the instruction
Can't always forward
Reg
IM
Reg
Reg
IM
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6
Time (in clock cycles)
lw $2, 20($1)
Programexecutionorder(in instructions)
and $4, $2, $5
IM Reg DM Reg
IM DM Reg
IM DM Reg
CC 7 CC 8 CC 9
or $8, $2, $6
add $9, $4, $2
slt $1, $6, $7
DM Reg
Reg
Reg
DM
11/1/2005 Comp 120 Fall 2005 23
Stalling
• We can stall the pipeline by keeping an instruction in the same stage
lw $2, 20($1)
Programexecutionorder(in instructions)
and $4, $2, $5
or $8, $2, $6
add $9, $4, $2
slt $1, $6, $7
Reg
IM
Reg
Reg
IM DM
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6Time (in clock cycles)
IM Reg DM RegIM
IM DM Reg
IM DM Reg
CC 7 CC 8 CC 9 CC 10
DM Reg
RegReg
Reg
bubble
11/1/2005 Comp 120 Fall 2005 24
• When we decide to branch, other instructions are in the pipeline!
• We are predicting “branch not taken”– need to add hardware for flushing instructions if we are wrong
Branch Hazards
Reg
Reg
CC 1
Time (in clock cycles)
40 beq $1, $3, 7
Programexecutionorder(in instructions)
IM Reg
IM DM
IM DM
IM DM
DM
DM Reg
Reg Reg
Reg
Reg
RegIM
44 and $12, $2, $5
48 or $13, $6, $2
52 add $14, $2, $2
72 lw $4, 50($7)
CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9
Reg
11/1/2005 Comp 120 Fall 2005 25
Improving Performance
• Try to avoid stalls! E.g., reorder these instructions:
lw $t0, 0($t1)lw $t2, 4($t1)sw $t2, 0($t1)sw $t0, 4($t1)
• Add a “branch delay slot”– the next instruction after a branch is always executed
– rely on compiler to “fill” the slot with something useful
• Superscalar: start more than one instruction in the same cycle
11/1/2005 Comp 120 Fall 2005 26
Dynamic Scheduling
• The hardware performs the “scheduling” – hardware tries to find instructions to execute
– out of order execution is possible
– speculative execution and dynamic branch prediction
• All modern processors are very complicated– Pentium 4: 20 stage pipeline, 6 simultaneous instructions
– PowerPC and Pentium: branch history table
– Compiler technology important
11/1/2005 Comp 120 Fall 2005 27
Chapter 7 Preview
Memory Hierarchy
11/1/2005 Comp 120 Fall 2005 28
Memory Hierarchy• Memory devices come in several different flavors
– SRAM – Static Ram
• fast (1 to 10ns)• expensive (>10 times DRAM)• small capacity (< ¼ DRAM)
– DRAM – Dynamic RAM
• 16 times slower than SRAM (50ns – 100ns)• Access time varies with address• Affordable ($160 / gigabyte)• 1 Gig considered big
– DISK
• Slow! (10ms access time)• Cheap! (< $1 / gigabyte)• Big! (1Tbyte is no problem)
11/1/2005 Comp 120 Fall 2005 29
• Users want large and fast memories!
Try to give it to them– build a memory hierarchy
Memory Hierarchy
CPU
Level n
Level 2
Level 1
Levels in thememory hierarchy
Increasing distance from the CPU in
access time
Size of the memory at each level
11/1/2005 Comp 120 Fall 2005 30
Locality
• A principle that makes having a memory hierarchy a good idea
• If an item is referenced,
temporal locality: it will tend to be referenced again soon
spatial locality: nearby items will tend to be referenced soon.
Why does code have locality?
11/1/2005 Comp 120 Fall 2005 31
10