Download - 11/1/2005Comp 120 Fall 20051 1 November Exam Postmortem 11 Classes to go! Read Sections 7.1 and 7.2 You read 6.1 for this time. Right? Pipelining then.

11/1/2005 Comp 120 Fall 2005 1

1 November

• Exam Postmortem• 11 Classes to go!• Read Sections 7.1 and 7.2• You read 6.1 for this time. Right?• Pipelining then on to Memory hierarchy

11/1/2005 Comp 120 Fall 2005 2

1

[6] Give as least 3 different formulas for execution time using the following variables. Each equation should be minimal; that is, it should not contain any variable that is not needed. CPI, R=clock rate, T=cycle time, M=MIPS, I=number of instructions in program, C=number of cycles in program.

CPI*I*T, CPI*I/R, C*T, C/R, I/(MIPS*10^6)

11/1/2005 Comp 120 Fall 2005 3

2

[12] Soon we will all have computers with multiple processors. This will allow us to run some kinds of programs faster. Suppose your computer has N processors and that a program you want to make faster has some fraction F of its computation that is inherently sequential (either all the processors have to do the same work thus duplicating effort, or the work has to be done by a single processor while the other processors are idle). What speedup can you expect to achieve on your new computer?

1/(F + (1-F)/N)

11/1/2005 Comp 120 Fall 2005 4

3

[10] Show the minimum sequence of instructions required to sum up an array of 5 integers, stored as consecutive words in memory, given the address of the first word in $a0; leaving the sum in register $v0?

5 loads + 4 adds = 9 instructions

11/1/2005 Comp 120 Fall 2005 5

4

[10] Which of the following cannot be EXACTLY represented by an IEEE double-precision floating-point number? (a) 0, (b) 10.2, (c) 1.625, (d) 11.5

11/1/2005 Comp 120 Fall 2005 6

5

• [12] All of the following equations are true for the idealized numbers you studied in algebra. Which ones are true for IEEE floating point numbers? Assume that all of the numbers are well within the range of largest and smallest possible numbers (that is, underflow and overflow are not a problem)– A+B=A if and only if B=0

– (A+B)+C = A+(B+C)

– A+B = B+A – A*(B+C) = A*B+A*C

11/1/2005 Comp 120 Fall 2005 7

6

[10] Consider the characteristics of two machines M1 and M2. M1 has a clock rate of 1GHz. M2 has a clock rate of 2GHz. There are 4 classes of instructions (A-D) in the instruction set. In a set of benchmark programs, the frequency of each class of instructions is shown in the table.

Instruction Class Frequency M1 CPI M2 CPI A 40% 2 6 B 25% 3 6 C 20% 3 6 D 15% 5 8

What is the average CPI for each machine? CPI1 = 2*.4+3*.25+3*.2+5*.15 = 2.9; CPI2 = 6*.4+6*.25+6*.2+8*.15 = 6.3. Which machine is faster? M1 By what factor faster is it? (1000*6.3)/(2000*2.9) = 1.086. What is the cycle time of each machine? M1=1ns, M2=0.5ns.

11/1/2005 Comp 120 Fall 2005 8

7

[8] Draw a block diagram showing how to implement 2’s complement subtraction of 3 bit numbers (A minus B) using 1-bit full-adder blocks with inputs A, B, and Cin and outputs Sum and Cout and any other AND, OR, or INVERT blocks you may need.

11/1/2005 Comp 120 Fall 2005 9

8

[10] Assume that multiply instructions take 12 cycles and account for 10% of the instructions in a typical program and that the other 90% of the instructions require an average of 4 cycles for each instruction. What percentage of time does the CPU spend doing multiplication?

12*0.1/(12*0.1+4*0.9)=25%

11/1/2005 Comp 120 Fall 2005 10

9

[10] In a certain set of benchmark programs about every 4th instruction is a load instruction that fetches data from main memory. The time required for a load is 50ns. The CPI for all other instructions is 4. Assuming the ISA’s are the same, how much faster will the benchmarks run with a 1GHz clock than with a 500MHz clock?

For 1GHz: 50 + 12 = 62 cycles and 62 nanoseconds

For 500MHz: 25 + 12 = 37 cycles and 74 nanoseconds

Speedup is 74 / 62 = 1.19

11/1/2005 Comp 120 Fall 2005 11

10

[12] Explain why floating point addition and subtraction may result in loss of precision but multiplication and division do not.

Before we can add we have to adjust the binary point so that the exponents are equal. This can cause many bits to be lost if the exponents are very different.

Multiplication doesn’t have this problem. All the bits play equally.

11/1/2005 Comp 120 Fall 2005 12

Results

0

1

2

3

4

5

6

7

<60 61-70 71-80 81-90 91-100

Average: 72.5

Median: 81

11/1/2005 Comp 120 Fall 2005 13

Chapter 6 Pipelining

11/1/2005 Comp 120 Fall 2005 14

Doing LaundryTime

76 PM 8 9 10 11 12 1 2 AM

A

B

C

D

Time76 PM 8 9 10 11 12 1 2 AM

A

B

C

D

Taskorder

Taskorder

11/1/2005 Comp 120 Fall 2005 15

Pipelining

• Improve performance by increasing instruction throughput

Ideal speedup is number of stages in the pipeline. Do we achieve this?

Instructionfetch

Reg ALUData

accessReg

8 nsInstruction

fetchReg ALU

Dataaccess

Reg

8 nsInstruction

fetch

8 ns

Time

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 4 6 8 10 12 14 16 18

2 4 6 8 10 12 14

...

Programexecutionorder(in instructions)

Instructionfetch

Reg ALUData

accessReg

Time

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 nsInstruction

fetchReg ALU

Dataaccess

Reg

2 nsInstruction

fetchReg ALU

Dataaccess

Reg

2 ns 2 ns 2 ns 2 ns 2 ns


11/1/2005 Comp 120 Fall 2005 16

• We have 5 stages. What needs to be controlled in each stage?– Instruction Fetch and PC Increment

– Instruction Decode / Register Fetch

– Execution

– Memory Stage

– Register Write Back

• How would control be handled in an automobile plant?– a fancy control center telling everyone what to do?

– should we use a finite state machine?

Pipeline control

11/1/2005 Comp 120 Fall 2005 17

Pipelining

• What makes it easy– all instructions are the same length

– just a few instruction formats

– memory operands appear only in loads and stores

• What makes it hard?– structural hazards: suppose we had only one memory

– control hazards: need to worry about branch instructions

– data hazards: an instruction depends on a previous instruction

• Individual Instructions still take the same number of cycles

• But we’ve improved the through-put by increasing the number of simultaneously executing instructions

11/1/2005 Comp 120 Fall 2005 18

Structural Hazards

Inst

Fetch

Reg

Read

ALU Data

Access

Reg Write

Inst

Fetch

Reg

Read

ALU Data

Access

Reg Write

Inst

Fetch

Reg

Read

ALU Data

Access

Reg Write

Inst

Fetch

Reg

Read

ALU Data

Access

Reg Write

11/1/2005 Comp 120 Fall 2005 19

• Problem with starting next instruction before first is finished– dependencies that “go backward in time” are data hazards

Data Hazards

IM Reg

IM Reg

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6

Time (in clock cycles)

sub $2, $1, $3


and $12, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

10 10 10 10 10/– 20 – 20 – 20 – 20 – 20

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

Value of register $2:

DM Reg

Reg

Reg

Reg

DM

11/1/2005 Comp 120 Fall 2005 20

• Have compiler guarantee no hazards• Where do we insert the “nops” ?

sub $2, $1, $3and $12, $2, $5or $13, $6, $2add $14, $2, $2sw $15, 100($2)

• Problem: this really slows us down!

Software Solution

11/1/2005 Comp 120 Fall 2005 21

• Use temporary results, don’t wait for them to be written– register file forwarding to handle read/write to same register

– ALU forwarding

Forwarding

IM Reg

IM Reg



sub $2, $1, $3

Programexecution order(in instructions)

and $12, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

10 10 10 10 10/– 20 – 20 – 20 – 20 – 20

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

Value of register $2 :

DM Reg

Reg

Reg

Reg

X X X – 20 X X X X XValue of EX/MEM :X X X X – 20 X X X XValue of MEM/WB :

DM

11/1/2005 Comp 120 Fall 2005 22

• Load word can still cause a hazard:– an instruction tries to read a register following a load instruction that writes to

the same register.

–

• Thus, we need a hazard detection unit to “stall” the instruction

Can't always forward

Reg

IM

Reg

Reg

IM



lw $2, 20($1)


and $4, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

or $8, $2, $6

add $9, $4, $2

slt $1, $6, $7

DM Reg

Reg

Reg

DM

11/1/2005 Comp 120 Fall 2005 23

Stalling

• We can stall the pipeline by keeping an instruction in the same stage

lw $2, 20($1)


and $4, $2, $5

or $8, $2, $6

add $9, $4, $2

slt $1, $6, $7

Reg

IM

Reg

Reg

IM DM

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6Time (in clock cycles)

IM Reg DM RegIM

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9 CC 10

DM Reg

RegReg

Reg

bubble

11/1/2005 Comp 120 Fall 2005 24

• When we decide to branch, other instructions are in the pipeline!

• We are predicting “branch not taken”– need to add hardware for flushing instructions if we are wrong

Branch Hazards

Reg

Reg

CC 1


40 beq $1, $3, 7


IM Reg

IM DM

IM DM

IM DM

DM

DM Reg

Reg Reg

Reg

Reg

RegIM

44 and $12, $2, $5

48 or $13, $6, $2

52 add $14, $2, $2

72 lw $4, 50($7)

CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9

Reg

11/1/2005 Comp 120 Fall 2005 25

Improving Performance

• Try to avoid stalls! E.g., reorder these instructions:

lw $t0, 0($t1)lw $t2, 4($t1)sw $t2, 0($t1)sw $t0, 4($t1)

• Add a “branch delay slot”– the next instruction after a branch is always executed

– rely on compiler to “fill” the slot with something useful

• Superscalar: start more than one instruction in the same cycle

11/1/2005 Comp 120 Fall 2005 26

Dynamic Scheduling

• The hardware performs the “scheduling” – hardware tries to find instructions to execute

– out of order execution is possible

– speculative execution and dynamic branch prediction

• All modern processors are very complicated– Pentium 4: 20 stage pipeline, 6 simultaneous instructions

– PowerPC and Pentium: branch history table

– Compiler technology important

11/1/2005 Comp 120 Fall 2005 27

Chapter 7 Preview

Memory Hierarchy

11/1/2005 Comp 120 Fall 2005 28

Memory Hierarchy• Memory devices come in several different flavors

– SRAM – Static Ram

• fast (1 to 10ns)• expensive (>10 times DRAM)• small capacity (< ¼ DRAM)

– DRAM – Dynamic RAM

• 16 times slower than SRAM (50ns – 100ns)• Access time varies with address• Affordable ($160 / gigabyte)• 1 Gig considered big

– DISK

• Slow! (10ms access time)• Cheap! (< $1 / gigabyte)• Big! (1Tbyte is no problem)

11/1/2005 Comp 120 Fall 2005 29

• Users want large and fast memories!

Try to give it to them– build a memory hierarchy

Memory Hierarchy

CPU

Level n

Level 2

Level 1

Levels in thememory hierarchy

Increasing distance from the CPU in

access time

Size of the memory at each level

11/1/2005 Comp 120 Fall 2005 30

Locality

• A principle that makes having a memory hierarchy a good idea

• If an item is referenced,

temporal locality: it will tend to be referenced again soon

spatial locality: nearby items will tend to be referenced soon.

Why does code have locality?

11/1/2005 Comp 120 Fall 2005 31

10