+ All Categories
Home > Documents > Chapter 4 Part2 - York University · Chapter 4 — The Processor — 5 Performance Issues n Longest...

Chapter 4 Part2 - York University · Chapter 4 — The Processor — 5 Performance Issues n Longest...

Date post: 19-Apr-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
29
CSE 2021 Computer Organization C Ch ha ap pt te er r 4 4 P Pa ar rt t 2 2 The Processor - Pipelining
Transcript
Page 1: Chapter 4 Part2 - York University · Chapter 4 — The Processor — 5 Performance Issues n Longest delay determines clock period n Critical path: load instruction ( lw) n Involving

CSE 2021 Computer Organization

CChhaapptteerr 44 PPaarrtt 22

The Processor - Pipelining

Page 2: Chapter 4 Part2 - York University · Chapter 4 — The Processor — 5 Performance Issues n Longest delay determines clock period n Critical path: load instruction ( lw) n Involving

Outline n  CPU overview n  Single cycle MIPS implementation

n  Simple subset n  Memory reference: lw, sw n  Arithmetic/logical: add, sub, and, or, slt n  Control transfer: beq, j

n  Pipelined MIPS implementation

Chapter 4 — The Processor — 2

Page 3: Chapter 4 Part2 - York University · Chapter 4 — The Processor — 5 Performance Issues n Longest delay determines clock period n Critical path: load instruction ( lw) n Involving

Chapter 4 — The Processor — 3

Single Cycle Implementation

Page 4: Chapter 4 Part2 - York University · Chapter 4 — The Processor — 5 Performance Issues n Longest delay determines clock period n Critical path: load instruction ( lw) n Involving

Why not single-cycle implementation? n  Assuming no delay at adder, sign extension, shift

left unit, PC, control unit and mux n  lw requires 5 functional units: instruction fetch, register

access, ALU, data memory access, register access n  sw requires 4 functional units: instruction fetch,

register access, ALU, data memory access n  R-type requires 4 functional units: instruction fetch,

register access, ALU, register access n  Branch requires 3 functional units: instruction fetch,

register access, ALU n  Jump requires 1 functional unit, instruction fetch

Chapter 4 — The Processor — 4

Page 5: Chapter 4 Part2 - York University · Chapter 4 — The Processor — 5 Performance Issues n Longest delay determines clock period n Critical path: load instruction ( lw) n Involving

Chapter 4 — The Processor — 5

Performance Issues n  Longest delay determines clock period

n  Critical path: load instruction (lw) n  Involving 5 functional units

n  Using a clock cycle of equal duration for each instruction is a waster of resources

n  Not feasible to vary period for different instructions

n  We will improve performance by pipelining

Page 6: Chapter 4 Part2 - York University · Chapter 4 — The Processor — 5 Performance Issues n Longest delay determines clock period n Critical path: load instruction ( lw) n Involving

Chapter 4 — The Processor — 6

Pipelining Analogy n  Pipelined laundry: overlapping execution

n  Parallelism improves performance

n  4 loads: n  Speedup

= 8/3.5 = 2.3

Page 7: Chapter 4 Part2 - York University · Chapter 4 — The Processor — 5 Performance Issues n Longest delay determines clock period n Critical path: load instruction ( lw) n Involving

Chapter 4 — The Processor — 7

Activity 1 n  Calculate what is the speedup factor if

there are 1000 washing jobs running in parallel?

Page 8: Chapter 4 Part2 - York University · Chapter 4 — The Processor — 5 Performance Issues n Longest delay determines clock period n Critical path: load instruction ( lw) n Involving

Chapter 4 — The Processor — 8

MIPS Pipeline n  Five stages, one step per stage

1.  IF: Instruction fetch from memory 2.  ID: Instruction decode & register read 3.  EX: Execute operation or calculate address 4.  MEM: Access memory operand 5.  WB: Write result back to register

Page 9: Chapter 4 Part2 - York University · Chapter 4 — The Processor — 5 Performance Issues n Longest delay determines clock period n Critical path: load instruction ( lw) n Involving

Chapter 4 — The Processor — 9

Pipeline Performance n  Assume time for stages is

n  100ps for register read or write n  200ps for other stages

n  Time for different types of single-cycle datapath

Instr Instr fetch Register read

ALU op Memory access

Register write

Total time

lw 200ps 100 ps 200ps 200ps 100 ps 800ps

sw 200ps 100 ps 200ps 200ps 700ps

R-format 200ps 100 ps 200ps 100 ps 600ps

beq 200ps 100 ps 200ps 500ps

Page 10: Chapter 4 Part2 - York University · Chapter 4 — The Processor — 5 Performance Issues n Longest delay determines clock period n Critical path: load instruction ( lw) n Involving

Chapter 4 — The Processor — 10

Pipeline Performance Single-cycle (Tc= 800ps)

Pipelined (Tc= 200ps)

Page 11: Chapter 4 Part2 - York University · Chapter 4 — The Processor — 5 Performance Issues n Longest delay determines clock period n Critical path: load instruction ( lw) n Involving

Activity 2 Calculate the speedup factor for running 2000 pipelined Load instructions.

Chapter 4 — The Processor — 11

Single-cycle (Tc= 800ps)

Pipelined (Tc= 200ps)

Page 12: Chapter 4 Part2 - York University · Chapter 4 — The Processor — 5 Performance Issues n Longest delay determines clock period n Critical path: load instruction ( lw) n Involving

Chapter 4 — The Processor — 12

Pipeline Speedup n  If all stages are balanced

n  i.e., all take the same time n  Time between instructionspipelined

Time between instructionsnonpipelined Number of stages

n  If not balanced, speedup is less n  Pipelining added some overhead (additional

100ps for Register Read) n  Speedup due to increased throughput n  Latency (execution time for each instruction)

remains the same

=

Page 13: Chapter 4 Part2 - York University · Chapter 4 — The Processor — 5 Performance Issues n Longest delay determines clock period n Critical path: load instruction ( lw) n Involving

Chapter 4 — The Processor — 13

Pipelining and ISA Design n  MIPS ISA designed for pipelining

n  All instructions are 32-bits n  Easier to fetch and decode in 1st and 2nd stage

n  Few and regular instruction formats n  Registers staying specified at almost the same bit

positions. n  Load/store addressing

n  MIPS does not allow operands to be directly used from the memory. Operands are first loaded into the registers.

n  Alignment of memory operands n  Data can be transferred from memory to registers

in a single data transfer command

Page 14: Chapter 4 Part2 - York University · Chapter 4 — The Processor — 5 Performance Issues n Longest delay determines clock period n Critical path: load instruction ( lw) n Involving

80x86 n  Instructions in 80x86 have variable length

from 1 byte to 17 bytes. This makes the first two stages, IF and ID, more challenging making pipelining difficult.

n  Due to variable instruction length in 80x86, the registers are specified at different bit positions.

n  80x86 allows direct operation on operands while in memory. An additional address stage is therefore needed in 80x86.

Chapter 4 — The Processor — 14

Page 15: Chapter 4 Part2 - York University · Chapter 4 — The Processor — 5 Performance Issues n Longest delay determines clock period n Critical path: load instruction ( lw) n Involving

Chapter 4 — The Processor — 15

Pipelining Hazards n  Hazards occur when the next instruction in

a pipelined program can not be executed until the prior instruction has been executed.

n  Structure hazards n  A required resource is busy

n  Data hazard n  Need to wait for previous instruction to

complete its data read/write n  Control hazard

n  Deciding on control action depends on previous instruction

Page 16: Chapter 4 Part2 - York University · Chapter 4 — The Processor — 5 Performance Issues n Longest delay determines clock period n Critical path: load instruction ( lw) n Involving

Chapter 4 — The Processor — 16

Structure Hazards n  Conflict for use of a resource n  In MIPS pipeline with a single memory

n  Load/store requires data access n  Instruction fetch would have to stall for that

cycle n  Would cause a pipeline “bubble”

n  Hence, pipelined datapaths require separate instruction/data memories n  Or separate instruction/data caches

Page 17: Chapter 4 Part2 - York University · Chapter 4 — The Processor — 5 Performance Issues n Longest delay determines clock period n Critical path: load instruction ( lw) n Involving

Chapter 4 — The Processor — 17

Structure Hazards n  Laundry analogy: A washer-dryer combo is used where a

load of clothes is washed and then dried in the same machine.

n  MIPS: A single memory unit used for data and instructions results in structural hazard

Page 18: Chapter 4 Part2 - York University · Chapter 4 — The Processor — 5 Performance Issues n Longest delay determines clock period n Critical path: load instruction ( lw) n Involving

Graphical Representation n  Shading in each block indicates what the element is used

for in the instruction n  Shading on left half of the block indicates the element is

being written n  Shading on the right half of the block indicates that the

element is being read

Chapter 4 — The Processor — 18

Page 19: Chapter 4 Part2 - York University · Chapter 4 — The Processor — 5 Performance Issues n Longest delay determines clock period n Critical path: load instruction ( lw) n Involving

Chapter 4 — The Processor — 19

Data Hazards n  An instruction depends on completion of

data access by a previous instruction n  add $s0, $t0, $t1 sub $t2, $s0, $t3

Page 20: Chapter 4 Part2 - York University · Chapter 4 — The Processor — 5 Performance Issues n Longest delay determines clock period n Critical path: load instruction ( lw) n Involving

Chapter 4 — The Processor — 20

Forwarding (aka Bypassing) n  Use result when it is computed

n  Don’t wait for it to be stored in a register n  Requires extra connections in the datapath

Page 21: Chapter 4 Part2 - York University · Chapter 4 — The Processor — 5 Performance Issues n Longest delay determines clock period n Critical path: load instruction ( lw) n Involving

Chapter 4 — The Processor — 21

Load-Use Data Hazard n  Can’t always avoid stalls by forwarding

n  If value not computed when needed n  Can’t forward backward in time!

Page 22: Chapter 4 Part2 - York University · Chapter 4 — The Processor — 5 Performance Issues n Longest delay determines clock period n Critical path: load instruction ( lw) n Involving

Chapter 4 — The Processor — 22

Code Scheduling to Avoid Stalls n  Reorder code to avoid use of load result in

the next instruction n  C code for A = B + E; C = B + F;

lw $t1, 0($t0)

lw $t2, 4($t0)

add $t3, $t1, $t2

sw $t3, 12($t0)

lw $t4, 8($t0)

add $t5, $t1, $t4

sw $t5, 16($t0)

stall

stall

lw $t1, 0($t0)

lw $t2, 4($t0)

lw $t4, 8($t0)

add $t3, $t1, $t2

sw $t3, 12($t0)

add $t5, $t1, $t4

sw $t5, 16($t0)

11 cycles 13 cycles

Page 23: Chapter 4 Part2 - York University · Chapter 4 — The Processor — 5 Performance Issues n Longest delay determines clock period n Critical path: load instruction ( lw) n Involving

Code Scheduling to Avoid Stalls

lw $t1,0($t0)

lw $t2,4($t0)

add $t3,$t1,$t2

Chapter 4 — The Processor — 23

Page 24: Chapter 4 Part2 - York University · Chapter 4 — The Processor — 5 Performance Issues n Longest delay determines clock period n Critical path: load instruction ( lw) n Involving

Code Scheduling to Avoid Stalls

lw $t1,0($t0)

lw $t2,4($t0)

Lw $t4,8($t0)

add $t3,$t1,$t2

Chapter 4 — The Processor — 24

Page 25: Chapter 4 Part2 - York University · Chapter 4 — The Processor — 5 Performance Issues n Longest delay determines clock period n Critical path: load instruction ( lw) n Involving

Chapter 4 — The Processor — 25

Control Hazards n  Branch determines flow of control

n  Fetching next instruction depends on branch outcome

n  Pipeline can’t always fetch correct instruction n  Still working on ID stage of branch

n  In MIPS pipeline n  Need to compare registers and compute

target early in the pipeline n  Add hardware to do it in ID stage

Page 26: Chapter 4 Part2 - York University · Chapter 4 — The Processor — 5 Performance Issues n Longest delay determines clock period n Critical path: load instruction ( lw) n Involving

Chapter 4 — The Processor — 26

Stall on Branch n  Wait until branch outcome determined

before fetching next instruction – slow! n  Adding extra hardware to determine the

branch address – still stalled!

lw $3, 300($0)

Page 27: Chapter 4 Part2 - York University · Chapter 4 — The Processor — 5 Performance Issues n Longest delay determines clock period n Critical path: load instruction ( lw) n Involving

Solution to Control Hazards n  Always

predict that the branch will fail and keep executing the program

n  Stall if branch is taken

Chapter 4 — The Processor — 27

Prediction correct

Prediction incorrect

Page 28: Chapter 4 Part2 - York University · Chapter 4 — The Processor — 5 Performance Issues n Longest delay determines clock period n Critical path: load instruction ( lw) n Involving

Activity 3 Using the graphical representation, show that the following program has a pipeline hazard. Find a solution to avoid pipeline stall. lw $t0, 0($t1) lw $t2, 4($t1) sw $t2, 0($t1) sw $t0, 4($t1)

Chapter 4 — The Processor — 28

Page 29: Chapter 4 Part2 - York University · Chapter 4 — The Processor — 5 Performance Issues n Longest delay determines clock period n Critical path: load instruction ( lw) n Involving

Chapter 4 — The Processor — 29

Pipeline Summary

n  Pipelining improves performance by increasing instruction throughput n  Executes multiple instructions in parallel n  Each instruction has the same latency

n  Subject to hazards n  Structure, data, control

n  Instruction set design affects complexity of pipeline implementation


Recommended