+ All Categories
Home > Documents > Out-of-Order Execution Structures

Out-of-Order Execution Structures

Date post: 12-Feb-2016
Category:
Upload: gusty
View: 40 times
Download: 0 times
Share this document with a friend
Description:
Out-of-Order Execution Structures. Based on: Complexity-Effective Superscalar Processors S. Palacharla, N. Jouppi and J. E. Smith, ISCA 97. MIPS R10000-Like Design . Fetch: Read instructions from I-Cache Predict Branches Pass on to Decode phase. Fetch Phase. Decode: Parse instruction - PowerPoint PPT Presentation
Popular Tags:
48
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures
Transcript
Page 1: Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Out-of-Order Execution Structures

Page 2: Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

MIPS R10000-Like Design

• Based on:– Complexity-Effective Superscalar Processors– S. Palacharla, N. Jouppi and J. E. Smith, ISCA 97

Page 3: Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Fetch Phase

• Fetch:– Read instructions from I-Cache– Predict Branches– Pass on to Decode phase

Page 4: Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Decode Phase

• Decode:– Parse instruction– Shuffle opcode parts to appropriate ports for

rename

Page 5: Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Renaming Phase

• Rename:– Map Architectural registers to Physical– Eliminate False Dependences– Passes renamed instructions to scheduler

• Called Dispatch

Page 6: Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Scheduling Phase

• Wakeup:– Instructions check whether they become ready– From Writeback: physical register names

• Select:– Amongst the ready select those to execute– Structural hazards

Page 7: Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Register File Read Phase

• Read source operands

Page 8: Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Bypass and Execute Phase

Page 9: Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Data Cache Access Phase

Page 10: Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Writeback Phase

• Write result to register file• Broadcast tag in order to wakeup waiting

instructions– Notice that the tag broadcast should happen TWO

cycles in advance of the result production

Page 11: Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Reservation Station Model

• Used by Pentium Pro, PowerPC 604• Re-order buffer holds values• Renaming points to re-order buffer entries

– Tomasulo-like

Page 12: Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Physical Register File vs. Reservation Station• Physical Register File

– Values reside in the register file– At writeback instructions broadcast the

register name• Reservation Stations:

– Values reside:– In the register file upon commit

• Non-speculative– In reservation stations prior to commit

• Speculative

Page 13: Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Quantifying Complexity• Critical Path Delay as a function of

architectural parameters– Instruction Window size (WinSize)– Issue Width (IW)

• Full-custom Implementations– Study the critical path– Delay model– Extrapolate how it will scale with “future”

technologies

Page 14: Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Renaming• Inputs:

– IW instructions– Up to 2 x Input register names– Up to 1 x Output register name

• Outputs:– 2 x input physical registers– 1 x new output physical register– 1 x previous physical register name for

checkpointing – Updated rename table

• Superscalar Issue complicates things a bit

Page 15: Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Renaming One Instruction

s1s2

d

RAT

p0

p31

s1s2

old

d

new reg from free listWrite port

Read port

Read port

Read port

1

1

2

1

For mispeculation recovery

Page 16: Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Renaming Two Instructions

RAT

s1 s2 d new d s1 s2 d new d

?

??

ps1 ps2 Old d new d ps1 ps2 Old dnew d

Cross BundleDependency Check Logic

Page 17: Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Renaming More Instructions• Dependency Checking logic for

instruction i must match against all preceding destinations

• If there are multiple matches it must enforce priority:– Pick the one closest to this instruction

Page 18: Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

RAT: SRAM Implementation

decoder SRAM cellbitlines

Sense amp

Arch reg

Phys reg

#ARCH REGS

lg(#PHYS REGS)

Page 19: Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

SRAM RAT cell

Page 20: Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

RAT: CAM Implementation

encoder

CAM cellArch reg

Phys reg#PHYS REGS

lg(#ARCH REGS)

Active bit

• One CAM per physical register• Active bit indicates the current map• New version by setting active bit

Page 21: Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

CAM Cell

Match

Wordline

Bitline

Bitline_B

Page 22: Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

SRAM vs. CAM• SRAM:

– Arch reg rows– Lg(phy reg) cols– SRAM read/write

• CAM:– Phy reg rows– Lg(arch reg) cols– CAM match– Update:

• Reset previous valid bit• Set current valid bit

Page 23: Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Scheduler: Part #1 - Wakeup

Page 24: Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Tree of Arbiters

REQ Signals

GRANT Signals

Anyreq raised if any req is active, Grant

Issued if arbiter enabled

Root enabled if

FU available

Scheduler: Part #2 - Select

For a Single FU

Location based select policy

Page 25: Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Select for more than one FUs• Handling Multiple FUs of Same Type:

– Stack Select logic blocks in series - hierarchy

– Mask the Request granted to previous unit

• NOT Feasible for More than 2 FUs• Alternative:

– statically partition issue window among FUs – MIPS R10000, HP PA 8000

Page 26: Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Datapath and BypassCommonly Used

Layout:

1 Bit-Slice

Turn on Tri-State A

to pass result of

FU1 to left operand of

FU0

Page 27: Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Complexity Analysis• Critical path delay as a function of:

– Issue Width – Window Size

• Register Renaming Table

• Wakeup and Select

• Bypass paths

Page 28: Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Methodology• A representative CMOS design is selected

from published alternatives

• Implemented the circuits for 3 technologies:– 0.8micron, 0.35micron and 0.18 micron

• Optimize for speed

• Wire parasitics in delay model– Rmetal, Cmetal

Page 29: Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Methodology• Feature size scaling: 1 / S• Voltage scaling: 1 / U

• Logic Delay = (CLx V) / I• Capac. Load: CL= 1 1 / S• Supply Voltage: V = 1 1 / U• Average charge/discharge current: I = 1

1 / U

• So, Logic Delay = (1 / S x 1 / U ) / (1 / U) = 1 / S

Page 30: Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Wire Delay• L: wire length• Intrinsic RC delay

• Rmetal: resistance per unit length

• Cmetal: capacitance per unit length

• 0.5: 1st order approximation of distributed RC model – uniformly distributed R & C

Page 31: Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Wire Delay Scaling• Metal Thickness doesn’t scale much

– Width ~ 1/S– Rmetal ~ S

• Fringe Capacitance dominates in smaller feature sizes – edges to parallel wires and the substrate

• Parallel plate – scales with 1 / S– Cmetal ~ S

• Length scales with 1/S• Overall Scale factor: S x S x (1/S)2 = 1

• Wire delay remains constant

Page 32: Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Register Renaming Table

Page 33: Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Dependency Checking Logic• Accessed in Parallel with Map Table• Every Logical Reg compared against

logical dest regs of current rename group• For IW=2,4,8, delay less than map table

r1

r4

r4

r4

r4

Page 34: Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Renaming Delay • SRAM scheme• Delay Components:

– Time to decode the arch reg index– Time to drive wordline– Time to pull down bit line– Time for SenseAmp to detect pull-down– MUX time ignored as control from dep.

Check logic comes in advance

Page 35: Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Renaming Circuit

Page 36: Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Decoder Delay

Page 37: Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Decoder Delay• Predecoding for speed• Length of predecode lines:

– Cellheight: Height of single cell excluding wordlines

– Wordline spacing• NVREG: # of virtual reg-s• x3: 3-operand instr-s

Page 38: Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Decoder Delay

• Tnand fall delay of NAND• Tnor rise delay of NOR

• Rnandpd NAND pull-down channel resistance + Predecode line metal resistance

• Ceq diff-n Cap. of NAND + gate Cap. of NOR + interconnect Cap.

Page 39: Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Decoder Delay• Substitute• Predecode line length, Req and Ceq we

get:

• c2: intrinsic RC delay of predecode line• c2 very small • Decoder delay ~linearly dependent on

IW

Page 40: Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Rename Delay• Wordline

• c2: intrinsic RC delay of wordline• c2 very small • Wordline delay ~linearly dependent on

IW

Page 41: Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Rename Delay• Bitline:

• c2 very small • Bitline delay ~linearly dependent on IW

• SenseAmp delay ~linearly dependent on IW

Page 42: Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Rename Logic Delay Scaling

• Feature size - [increase in bitline&wordline delay with increasing IW]

• 0.8um: IW 2 8 Bitline delay + 37%• 0.18um: IW 28 Bitline delay + 53%

• Total delay increases linearly with IW

• Each Component shows linear increase with IW

• Bitline Delay > Wordline Delay

• Bitline length ~ # of Logical reg-s

• Wordline length ~ width of physical reg designator

IW impact on delay worsenswith decreasing featuresize

Page 43: Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Wakeup Delay• Critical Path: Mismatch Pull ready signal low• Delay Components:

– Tag drivers drive tag lines - vertical– Mismatched bit: pull down stack pull matchline low –

horizontal– Final OR gate or all the matchlines of an operand

tag

• Ttagdrive ~ Driver Pullup R & Tagline length & Tagline Load C

• Quadratic component significant for IW>2 & 0.18um

Page 44: Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Wakeup Delay• Quadratic component Small for both

cases• Both delays ~linearly dependent on IW

Page 45: Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Wakeup Delay: IW and Window Size• 0.18um Process• Quadratic

dependence• Issue width has

greater effect increase all 3 delay components

• As IW & WinSize + together delay actually changes like: THIS

Page 46: Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Wakeup Delay: Window Size

• 8 way & 0.18 Process• Tag drive delay increases rapidly with WinSize +• Match OR delay constant

Page 47: Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Wakeup Delay: Feature size

• 8 way & 64 entry window• Tag drive and Tag match delays do not scale as well as MatchOR

delay • Match OR logic delay• Others also have wire delays

Page 48: Out-of-Order Execution Structures

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Selection Logic and Bypass Delay• Selection

– Logarithmically dependent on WinSize

• Bypass: Delay dependent on (IW)2


Recommended