JR.S00 1
Lecture 2: Review ofPerformance/Cost/Power Metricsand Architectural Basics
Prof. Jan M. RabaeyComputer Science 252
Spring 2000“Computer Architecture in Cory Hall”
JR.S00 2
Review Lecture 1
• Class Organization– Class Projects
• Trends in the Industry and Driving Forces
JR.S00 3
Computer Architecture Topics
Instruction Set Architecture
Pipelining, Hazard Resolution,Superscalar, Reordering, Prediction, Speculation,Vector, VLIW, DSP, Reconfiguration
Addressing,Protection,Exception Handling
L1 Cache
L2 Cache
DRAM
Disks, WORM, Tape
Coherence,Bandwidth,Latency
Emerging TechnologiesInterleavingBus protocols
RAID
VLSI
Input/Output and Storage
MemoryHierarchy
Pipelining and Instruction Level Parallelism
JR.S00 4
Computer Architecture Topics
M
Interconnection NetworkS
PMPMPMP° ° °
Topologies,Routing,Bandwidth,Latency,Reliability
Network Interfaces
Shared Memory,Message Passing,Data Parallelism
Processor-Memory-Switch
MultiprocessorsNetworks and Interconnections
JR.S00 5
The Secret of Architecture Design:Measurement and Evaluation
Design
Analysis
Architecture Design is an iterative process:• Searching the space of possible designs• At all levels of computer systems
Creativity
Good IdeasGood Ideas
Mediocre IdeasBad Ideas
Cost /PerformanceAnalysis
JR.S00 6
Computer Engineering Methodology
Simulate NewSimulate NewDesigns andDesigns and
OrganizationsOrganizations
TechnologyTrends
Evaluate ExistingEvaluate ExistingSystems for Systems for BottlenecksBottlenecks
Benchmarks
Workloads
Implement NextImplement NextGeneration SystemGeneration System
ImplementationComplexity Analysis
Design
Imple-mentation
JR.S00 7
Measurement Tools
• Hardware: Cost, delay, area, power estimation• Benchmarks, Traces, Mixes• Simulation (many levels)
– ISA, RT, Gate, Circuit
• Queuing Theory• Rules of Thumb• Fundamental “Laws”/Principles
JR.S00 8
Review:Performance, Cost, Power
JR.S00 9
Metric 1: Performance
• Time to run the task– Execution time, response time, latency
• Tasks per day, hour, week, sec, ns …– Throughput, bandwidth
Plane
Boeing 747
Concorde
Speed
610 mph
1350 mph
DC to Paris
6.5 hours
3 hours
Passengers
470
132
Throughput
286,700
178,200
In passenger-mile/hour
JR.S00 10
The Performance Metric
"X is n times faster than Y" means
ExTime(Y) Performance(X)--------- = ---------------ExTime(X) Performance(Y)
• Speed of Concorde vs. Boeing 747
• Throughput of Boeing 747 vs. Concorde
JR.S00 11
Amdahl's LawSpeedup due to enhancement E: ExTime w/o E Performance w/ ESpeedup(E) = ------------- = ------------------- ExTime w/ E Performance w/o E
Suppose that enhancement E accelerates a fraction Fof the task by a factor S, and the remainder of thetask is unaffected
JR.S00 12
Amdahl’s Law
ExTimenew = ExTimeold x (1 - Fractionenhanced) + Fractionenhanced
Speedupoverall =ExTimeold
ExTimenew
Speedupenhanced
=1
(1 - Fractionenhanced) + Fractionenhanced
Speedupenhanced
JR.S00 13
Amdahl’s Law
• Floating point instructions improved to run 2X;but only 10% of actual instructions are FP
Speedupoverall = 10.95
= 1.053
ExTimenew = ExTimeold x (0.9 + .1/2) = 0.95 x ExTimeold
Law of diminishing return:Law of diminishing return:Focus on the common case!Focus on the common case!
JR.S00 14
Metrics of Performance
Compiler
Programming Language
Application
DatapathControl
Transistors Wires Pins
ISA
Function Units
(millions) of Instructions per second: MIPS(millions) of (FP) operations per second: MFLOP/s
Cycles per second (clock rate)
Megabytes per second
Answers per monthOperations per second
JR.S00 15
Aspects of CPU PerformanceCPU time = Seconds = Instructions x Cycles x Seconds
Program Program Instruction Cycle
CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle
Inst Count CPI Clock RateProgram X
Compiler X (X)
Inst. Set. X X
Organization X X
Technology X
JR.S00 16
Cycles Per Instruction
CPU time = CycleTime * Σ CPI * Ii = 1
n
i i
CPI = Σ CPI * F where F = I i = 1
n
i i i i
Instruction Count
“Instruction Frequency”
Invest Resources where time is Spent!Invest Resources where time is Spent!
CPI = Cycles / Instruction Count = (CPU Time * Clock Rate) / Instruction Count
“Average Cycles per Instruction”
JR.S00 17
Example: Calculating CPI
Typical Mix
Base Machine (Reg / Reg)Op Freq CPIi CPIi*Fi (% Time)ALU 50% 1 .5 (33%)Load 20% 2 .4 (27%)Store 10% 2 .2 (13%)Branch 20% 2 .4 (27%) 1.5
JR.S00 18
Creating Benchmark Sets
•Real programs•Kernels•Toy benchmarks•Synthetic benchmarks
– e.g. Whetstones and Dhrystones
JR.S00 19
SPEC: System Performance EvaluationCooperative• First Round 1989
– 10 programs yielding a single number (“SPECmarks”)
• Second Round 1992– SPECInt92 (6 integer programs) and SPECfp92 (14 floating point
programs)» Compiler Flags unlimited. March 93 of DEC 4000 Model 610:spice: unix.c:/def=(sysv,has_bcopy,”bcopy(a,b,c)=
memcpy(b,a,c)”wave5: /ali=(all,dcom=nat)/ag=a/ur=4/ur=200nasa7: /norecu/ag=a/ur=4/ur2=200/lc=blas
• Third Round 1995– new set of programs: SPECint95 (8 integer programs) and
SPECfp95 (10 floating point)– “benchmarks useful for 3 years”– Single flag setting for all programs: SPECint_base95,
SPECfp_base95
JR.S00 20
How to Summarize Performance• Arithmetic mean (weighted arithmetic mean)
tracks execution time: Σ(Ti)/n or Σ(Wi*Ti)• Harmonic mean (weighted harmonic mean) of
rates (e.g., MFLOPS) tracks execution time:n/ Σ(1/Ri) or n/Σ(Wi/Ri)
• Normalized execution time is handy for scalingperformance (e.g., X times faster thanSPARCstation 10)
– Arithmetic mean impacted by choice of reference machine
• Use the geometric mean for comparison:∏ (Ti)^1/n
– Independent of chosen machine– but not good metric for total execution time
JR.S00 21
SPEC First Round• One program: 99% of time in single line of code• New front-end compiler could improve dramatically
Benchmark
0
100
200
300
400
500
600
700
800
gcc
epre
sso
spic
e
dodu
c
nasa
7 li
eqnt
ott
mat
rix3
00 fppp
p
tom
catv
IBM Powerstation 550 for 2 different compilers
JR.S00 22
Impact of Meanson SPECmark89 for IBM 550(without and with special compiler option)
Ratio to VAX: Time: Weighted Time:Program Before After Before After Before Aftergcc 30 29 49 51 8.91 9.22espresso 35 34 65 67 7.64 7.86spice 47 47 510 510 5.69 5.69doduc 46 49 41 38 5.81 5.45nasa7 78 144 258 140 3.43 1.86li 34 34 183 183 7.86 7.86eqntott 40 40 28 28 6.68 6.68matrix300 78 730 58 6 3.43 0.37fpppp 90 87 34 35 2.97 3.07tomcatv 33 138 20 19 2.01 1.94Mean 54 72 124 108 54.42 49.99
Geometric Arithmetic Weighted Arith.Ratio 1.33 Ratio 1.16 Ratio 1.09
JR.S00 23
Performance Evaluation• “For better or worse, benchmarks shape a field”• Good products created when have:
– Good benchmarks– Good ways to summarize performance
• Given sales is a function in part of performancerelative to competition, investment in improvingproduct as reported by performance summary
• If benchmarks/summary inadequate, then choosebetween improving product for real programs vs.improving product to get more sales;Sales almost always wins!
• Execution time is the measure of computerperformance!
JR.S00 24
Integrated Circuits Costs
Die Cost goes roughly with die area4
Test_Die Die_Area 2Wafer_diam Die_Area
2m/2)(Wafer_dia wafer per Dies −⋅
×π−π=
α×+×=
α−Die_area sityDefect_Den 1 dWafer_yiel YieldDie
yieldtest Finalcost Packaging cost Testingcost Die cost IC ++=
yield Die Wafer per DiescostWafer cost Die×
=
JR.S00 25
Real World Examples
Chip Metal Line Wafer Defect Area Dies/ Yield Die Cost layers width cost /cm2 mm2 wafer386DX 2 0.90 $900 1.0 43 360 71% $4486DX2 3 0.80 $1200 1.0 81 181 54% $12PowerPC 601 4 0.80 $1700 1.3 121 115 28% $53HP PA 7100 3 0.80 $1300 1.0 196 66 27% $73DEC Alpha 3 0.70 $1500 1.2 234 53 19% $149SuperSPARC 3 0.70 $1700 1.6 256 48 13% $272Pentium 3 0.80 $1500 1.5 296 40 9% $417
– From "Estimating IC Manufacturing Costs,” by Linley Gwennap,Microprocessor Report, August 2, 1993, p. 15
JR.S00 26
Cost/PerformanceWhat is Relationship of Cost to Price?
• Recurring Costs– Component Costs– Direct Costs (add 25% to 40%) recurring costs: labor, purchasing, scrap,
warranty
• Non-Recurring Costs or Gross Margin (add 82% to186%)(R&D, equipment maintenance, rental, marketing, sales, financingcost, pretax profits, taxes
• Average Discount to get List Price (add 33% to 66%): volumediscounts and/or retailer markup
ComponentCost
Direct Cost
GrossMargin
AverageDiscount
Avg. Selling Price
List Price
15% to 33% 6% to 8%34% to 39%
25% to 40%
JR.S00 27
• Assume purchase 10,000 units
Chip Prices (August 1993)
Chip Area Mfg. Price Multi- Commentmm2 cost plier
386DX 43 $9 $31 3.4 Intense CompetitionIntense Competition486DX2 81 $35 $245 7.0 No CompetitionNo CompetitionPowerPC 601 121 $77 $280 3.6 DEC Alpha 234 $202 $1231 6.1 Recoup R&D?Pentium 296 $473 $965 2.0 Early in shipments
JR.S00 28
Summary: Price vs. Cost
0%
20%
40%
60%
80%
100%
Mini W/S PC
Average Discount
Gross Margin
Direct Costs
Component Costs
0
1
2
3
4
5
Mini W/S PC
Average Discount
Gross Margin
Direct Costs
Component Costs
4.73.8
1.8
3.52.5
1.5
JR.S00 29
386 386
486 486
Pentium(R) Pentium(R) MMX
Pentium Pro (R)
Pentium II (R)
1
10
100
1.5µ 1µ 0.8µ 0.6µ 0.35µ 0.25µ 0.18µ 0.13µ
Max
Pow
er (W
atts
) ?
Power/Energy
Ê Lead processor power increases every generation
Ë Compactions provide higher performance at lower power
Sou
rce:
Inte
l
JR.S00 30
• Power dissipation: rate at which energy istaken from the supply (power source) andtransformed into heat
P = E/t• Energy dissipation for a given instruction
depends upon type of instruction (and stateof the processor)
Energy/Power
P = (1/CPU Time) * Σ E * Ii = 1
n
i i
JR.S00 31
The Energy-Flexibility Gap
Embedded ProcessorsSA1100.4 MIPS/mW
ASIPsDSPs 2 V DSP: 3 MOPS/mW
DedicatedHW
Flexibility (Coverage)
Ene
rgy
Eff
icie
ncy
MO
PS/m
W (o
r M
IPS/
mW
)
0.1
1
10
100
1000
ReconfigurableProcessor/Logic
Pleiades10-80 MOPS/mW
JR.S00 32
Summary, #1• Designing to Last through Trends
Capacity SpeedLogic 2x in 3 years 2x in 3 yearsSPEC RATING: 2x in 1.5 yearsDRAM 4x in 3 years 2x in 10 yearsDisk 4x in 3 years 2x in 10 years
• 6yrs to graduate => 16X CPU speed, DRAM/Disk size• Time to run the task
– Execution time, response time, latency• Tasks per day, hour, week, sec, ns, …
– Throughput, bandwidth• “X is n times faster than Y” means
ExTime(Y) Performance(X) --------- = -------------- ExTime(X) Performance(Y)
JR.S00 33
Summary, #2
• Amdahl’s Law:
• CPI Law:
• Execution time is the REAL measure of computerperformance!
• Good products created when have:– Good benchmarks, good ways to summarize performance
• Different set of metrics apply to embeddedsystems
Speedupoverall =ExTimeold
ExTimenew
=1
(1 - Fractionenhanced) + Fractionenhanced
Speedupenhanced
CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle
CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle
JR.S00 34
Review:Instruction Sets, Pipelines, and Caches
JR.S00 35
Computer Architecture Is …
the attributes of a [computing] system as seenby the programmer, i.e., the conceptualstructure and functional behavior, as distinctfrom the organization of the data flows andcontrols the logic design, and the physicalimplementation.
Amdahl, Blaaw, and Brooks, 1964
JR.S00 36
Computer Architecture’s ChangingDefinition
• 1950s to 1960s:Computer Architecture Course = Computer Arithmetic
• 1970s to mid 1980s:Computer Architecture Course = Instruction SetDesign, especially ISA appropriate for compilers
• 1990s:Computer Architecture Course = Design of CPU,memory system, I/O system, Multiprocessors
JR.S00 37
Computer Architecture is ...
Instruction Set Architecture
Organization
Hardware
JR.S00 38
Instruction Set Architecture (ISA)
instruction set
software
hardware
JR.S00 39
Interface Design
A good interface:• Lasts through many implementations (portability,
compatability)• Is used in many differeny ways (generality)• Provides convenient functionality to higher levels• Permits an efficient implementation at lower levels
Interfaceimp 1
imp 2
imp 3
use
use
use
time
JR.S00 40
Evolution of Instruction SetsSingle Accumulator (EDSAC 1950)
Accumulator + Index Registers(Manchester Mark I, IBM 700 series 1953)
Separation of Programming Model from Implementation
High-level Language Based Concept of a Family(B5000 1963) (IBM 360 1964)
General Purpose Register Machines
Complex Instruction Sets Load/Store Architecture
RISC
(Vax, Intel 432 1977-80) (CDC 6600, Cray 1 1963-76)
(Mips,Sparc,HP-PA,IBM RS6000,PowerPC . . .1987)
LIW/”EPIC”? (IA-64. . .1999)
JR.S00 41
Evolution of Instruction Sets
• Major advances in computer architecture aretypically associated with landmark instructionset designs
– Ex: Stack vs GPR (System 360)
• Design decisions must take into account:– technology– machine organization– programming languages– compiler technology– operating systems– applications
• And they in turn influence these
JR.S00 42
A "Typical" RISC
• 32-bit fixed format instruction (3 formats I,R,J)• 32 32-bit GPR (R0 contains zero, DP take pair)• 3-address, reg-reg arithmetic instruction• Single address mode for load/store:
base + displacement– no indirection
• Simple branch conditions (based on register values)• Delayed branch
see: SPARC, MIPS, HP PA-Risc, DEC Alpha, IBM PowerPC, CDC 6600, CDC 7600, Cray-1, Cray-2, Cray-3
JR.S00 43
Example: MIPS ( DLX)
Op
31 26 01516202125
Rs1 Rd immediate
Op
31 26 025
Op
31 26 01516202125
Rs1 Rs2
target
Rd Opx
Register-Register561011
Register-Immediate
Op
31 26 01516202125
Rs1 Rs2/Opx immediate
Branch
Jump / Call
JR.S00 44
Pipelining: Its Natural!
• Laundry Example• Ann, Brian, Cathy, Dave
each have one load of clothesto wash, dry, and fold
• Washer takes 30 minutes
• Dryer takes 40 minutes
• “Folder” takes 20 minutes
A B C D
JR.S00 45
Sequential Laundry
• Sequential laundry takes 6 hours for 4 loads• If they learned pipelining, how long would laundry take?
A
B
C
D
30 40 20 30 40 20 30 40 20 30 40 20
6 PM 7 8 9 10 11 Midnight
Task
Order
Time
JR.S00 46
Pipelined LaundryStart work ASAP
• Pipelined laundry takes 3.5 hours for 4 loads
A
B
C
D
6 PM 7 8 9 10 11 Midnight
Task
Order
Time
30 40 40 40 40 20
JR.S00 47
Pipelining Lessons• Pipelining doesn’t help
latency of single task, ithelps throughput ofentire workload
• Pipeline rate limited byslowest pipeline stage
• Multiple tasks operatingsimultaneously
• Potential speedup =Number pipe stages
• Unbalanced lengths ofpipe stages reducesspeedup
• Time to “fill” pipeline andtime to “drain” it reducesspeedup
A
B
C
D
6 PM 7 8 9
Task
Order
Time
30 40 40 40 40 20
JR.S00 48
Computer Pipelines
• Execute billions of instructions, sothroughout is what matters
• DLX desirable features: all instructions samelength, registers located in same place ininstruction format, memory operands only inloads or stores
JR.S00 49
5 Steps of DLX DatapathFigure 3.1, Page 130
MemoryAccess
WriteBack
InstructionFetch
Instr. DecodeReg. Fetch
ExecuteAddr. Calc
LMD
ALU
MUX
Memory
Reg File
MUXMUX
DataMemory
MUX
SignExtend
4
Adder Zero?
Next SEQ PC
Address
Next PC
WB Data
Inst
RD
RS1
RS2
Imm
JR.S00 50
5 Steps of DLX DatapathFigure 3.4, Page 134
MemoryAccess
WriteBack
InstructionFetch
Instr. DecodeReg. Fetch
ExecuteAddr. Calc
ALU
Memory
Reg File
MUXMUX
DataMemory
MUX
SignExtend
Zero?
IF/ID
ID/EX
MEM/WB
EX/MEM4
Adder
Next SEQ PC Next SEQ PC
RD RD RD WB
Data
• Data stationary control– local decode for each instruction phase / pipeline stage
Next PC
Address
RS1
RS2
Imm
MUX
JR.S00 51
Visualizing PipeliningFigure 3.3, Page 133
Instr.
Order
Time (clock cycles)
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle 7Cycle 5
JR.S00 52
Its Not That Easy for Computers
• Limits to pipelining: Hazards prevent nextinstruction from executing during its designatedclock cycle
– Structural hazards: HW cannot support this combination ofinstructions - two dogs fighting for the same bone
– Data hazards: Instruction depends on result of priorinstruction still in the pipeline
– Control hazards: Caused by delay between the fetching ofinstructions and decisions about changes in control flow(branches and jumps).
JR.S00 53
One Memory Port/Structural HazardsFigure 3.6, Page 142
Instr.
Order
Time (clock cycles)
Load
Instr 1
Instr 2
Instr 3
Instr 4
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle 7Cycle 5
Reg ALU DMemIfetch Reg
JR.S00 54
One Memory Port/Structural HazardsFigure 3.7, Page 143
Instr.
Order
Time (clock cycles)
Load
Instr 1
Instr 2
Stall
Instr 3
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle 7Cycle 5
Reg ALU DMemIfetch Reg
Bubble Bubble Bubble BubbleBubble
JR.S00 55
Speed Up Equation for Pipelining
pipelined
dunpipeline TimeCycle TimeCycle CPI stall Pipeline CPI Ideal
depth Pipeline CPI Ideal Speedup ×+
×=
pipelined
dunpipeline TimeCycle TimeCycle CPI stall Pipeline 1
depth Pipeline Speedup ×+
=
Instper cycles Stall Average CPI Ideal CPIpipelined +=
For simple RISC pipeline, CPI = 1:
JR.S00 56
Example: Dual-port vs. Single-port• Machine A: Dual ported memory (“Harvard Architecture”)• Machine B: Single ported memory, but its pipelined
implementation has a 1.05 times faster clock rate• Ideal CPI = 1 for both• Loads are 40% of instructions executed
SpeedUpA = Pipeline Depth/(1 + 0) x (clockunpipe/clockpipe) = Pipeline Depth
SpeedUpB = Pipeline Depth/(1 + 0.4 x 1) x (clockunpipe/(clockunpipe / 1.05) = (Pipeline Depth/1.4) x 1.05 = 0.75 x Pipeline Depth
SpeedUpA / SpeedUpB = Pipeline Depth/(0.75 x Pipeline Depth) = 1.33
• Machine A is 1.33 times faster
JR.S00 57
Instr.
Order
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Data Hazard on R1Figure 3.9, page 147Time (clock cycles)
IF ID/RF EX MEM WB
JR.S00 58
• Read After Write (RAW)InstrJ tries to read operand before InstrI writes it
• Caused by a “Dependence” (in compilernomenclature). This hazard results from an actualneed for communication.
Three Generic Data Hazards
I: add r1,r2,r3J: sub r4,r1,r3
JR.S00 59
• Write After Read (WAR)InstrJ writes operand before InstrI reads it
• Called an “anti-dependence” by compiler writers.This results from reuse of the name “r1”.
• Can’t happen in DLX 5 stage pipeline because:– All instructions take 5 stages, and– Reads are always in stage 2, and– Writes are always in stage 5
I: sub r4,r1,r3J: add r1,r2,r3K: mul r6,r1,r7
Three Generic Data Hazards
JR.S00 60
Three Generic Data Hazards• Write After Write (WAW)
InstrJ writes operand before InstrI writes it.
• Called an “output dependence” by compiler writersThis also results from the reuse of name “r1”.
• Can’t happen in DLX 5 stage pipeline because:– All instructions take 5 stages, and– Writes are always in stage 5
• Will see WAR and WAW in later more complicatedpipes
I: sub r1,r4,r3J: add r1,r2,r3K: mul r6,r1,r7
JR.S00 61
Time (clock cycles)
Forwarding to Avoid Data HazardFigure 3.10, Page 149
Inst
r.
Order
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
JR.S00 62
HW Change for ForwardingFigure 3.20, Page 161
MEM/WR
ID/EX
EX/MEM DataMemory
ALU
muxmux
Registers
NextPC
Immediate
mux
JR.S00 63
Time (clock cycles)
Instr.
Order
lw r1, 0(r2)
sub r4,r1,r6
and r6,r1,r7
or r8,r1,r9
Data Hazard Even with ForwardingFigure 3.12, Page 153
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
JR.S00 64
Data Hazard Even with ForwardingFigure 3.13, Page 154
Time (clock cycles)
or r8,r1,r9
Instr.
Order
lw r1, 0(r2)
sub r4,r1,r6
and r6,r1,r7
Reg ALU DMemIfetch Reg
RegIfetch ALU DMem RegBubble
Ifetch ALU DMem RegBubble Reg
Ifetch ALU DMemBubble Reg
JR.S00 65
Try producing fast code fora = b + c;d = e – f;
assuming a, b, c, d ,e, and f in memory.Slow code:
LW Rb,bLW Rc,cADD Ra,Rb,RcSW a,RaLW Re,eLW Rf,fSUB Rd,Re,RfSW d,Rd
Software Scheduling to Avoid LoadHazards
Fast code:LW Rb,bLW Rc,cLW Re,eADD Ra,Rb,RcLW Rf,fSW a,RaSUB Rd,Re,RfSW d,Rd
JR.S00 66
Control Hazard on BranchesThree Stage Stall
10: beq r1,r3,36
14: and r2,r3,r5
18: or r6,r1,r7
22: add r8,r1,r9
36: xor r10,r1,r11
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
JR.S00 67
Branch Stall Impact
• If CPI = 1, 30% branch,Stall 3 cycles => new CPI = 1.9!
• Two part solution:– Determine branch taken or not sooner, AND– Compute taken branch address earlier
• DLX branch tests if register = 0 or ≠ 0• DLX Solution:
– Move Zero test to ID/RF stage– Adder to calculate new PC in ID/RF stage– 1 clock cycle penalty for branch versus 3
JR.S00 68
Pipelined DLX DatapathFigure 3.22, page 163
MemoryAccess
WriteBack
InstructionFetch
Instr. DecodeReg. Fetch
ExecuteAddr. Calc.This is the correct 1 cyclelatency implementation!
JR.S00 69
Four Branch Hazard Alternatives
#1: Stall until branch direction is clear#2: Predict Branch Not Taken
– Execute successor instructions in sequence– “Squash” instructions in pipeline if branch actually taken– Advantage of late pipeline state update– 47% DLX branches not taken on average– PC+4 already calculated, so use it to get next instruction
#3: Predict Branch Taken– 53% DLX branches taken on average– But haven’t calculated branch target address in DLX
» DLX still incurs 1 cycle branch penalty» Other machines: branch target known before outcome
JR.S00 70
Four Branch Hazard Alternatives
#4: Delayed Branch– Define branch to take place AFTER a following instruction
branch instructionsequential successor1sequential successor2........sequential successorn
branch target if taken
– 1 slot delay allows proper decision and branch targetaddress in 5 stage pipeline
– DLX uses this
Branch delay of length n
JR.S00 71
Delayed Branch• Where to get instructions to fill branch delay slot?
– Before branch instruction– From the target address: only valuable when branch taken– From fall through: only valuable when branch not taken– Cancelling branches allow more slots to be filled
• Compiler effectiveness for single branch delay slot:– Fills about 60% of branch delay slots– About 80% of instructions executed in branch delay slots useful
in computation– About 50% (60% x 80%) of slots usefully filled
• Delayed Branch downside: 7-8 stage pipelines,multiple instructions issued per clock (superscalar)
JR.S00 72
Evaluating Branch Alternatives
Scheduling Branch CPI speedup v. speedup v.scheme penalty unpipelined stall
Stall pipeline 3 1.42 3.5 1.0Predict taken 1 1.14 4.4 1.26Predict not taken 1 1.09 4.5 1.29Delayed branch 0.5 1.07 4.6 1.31
Conditional & Unconditional = 14%, 65% change PC
Pipeline speedup = Pipeline depth1 +Branch frequency ×Branch penalty
JR.S00 73
Summary :Control and Pipelining
• Just overlap tasks; easy if tasks are independent• Speed Up ≤ Pipeline Depth; if ideal CPI is 1, then:
• Hazards limit performance on computers:– Structural: need more HW resources– Data (RAW,WAR,WAW): need forwarding, compiler scheduling– Control: delayed branch, prediction
pipelined
dunpipeline TimeCycle TimeCycle CPI stall Pipeline 1
depth Pipeline Speedup ×+
=