+ All Categories
Home > Documents > CS 7810 Lecture 2

CS 7810 Lecture 2

Date post: 12-Feb-2016
Category:
Upload: toki
View: 33 times
Download: 0 times
Share this document with a friend
Description:
CS 7810 Lecture 2. Complexity-Effective Superscalar Processors S. Palacharla, N.P. Jouppi, J.E. Smith U. Wisconsin, WRL ISCA ’97 . Complexity-Effective. Conflict between clock speed and parallelism Goals of the paper: Characterize complexity as a function of issue - PowerPoint PPT Presentation
Popular Tags:
34
CS 7810 Lecture 2 Complexity-Effective Superscalar Processors S. Palacharla, N.P. Jouppi, J.E. Smith U. Wisconsin, WRL ISCA ’97
Transcript
Page 1: CS 7810    Lecture 2

CS 7810 Lecture 2

Complexity-Effective Superscalar Processors

S. Palacharla, N.P. Jouppi, J.E. SmithU. Wisconsin, WRL

ISCA ’97

Page 2: CS 7810    Lecture 2

Complexity-Effective

• Conflict between clock speed and parallelism

• Goals of the paper:

Characterize complexity as a function of issue width, window size, and feature size

Propose clustered microarchitecture that allows fast clocks with high parallelism

Page 3: CS 7810    Lecture 2

Current Trends (circa 1997)

• More functional units, large in-flight windows

• Impact on cycle-time critical structures Register renaming Instruction wake-up Instruction selection Result bypass Register files Caches

Page 4: CS 7810    Lecture 2

Wire Delay Trends

• Logic delays scale linearly with feature size

• Wire delay ~ RC = Rm x Cm x L2

• Rm = / (width x thickness)

• Cm = 2 x x 0 x (thickness/width + width/thickness)

• Rm ~ S ; Cm ~ S ; L ~ 1/S (gate size scaled by 1/S)

• Hence, delay across 50K gates is constant in ps and is linear with S in terms of FO4

Page 5: CS 7810    Lecture 2

Update on Wire Delays

• “The Future of Wires”, Ho, Mai, Horowitz, 2001

• Cm actually decreases with reduced feature widths

• Hence, wire delay across 50K gates (in FO4) increases only slightly and is not quite linear with S – uses repeaters

• Wire delays are still a problem (though, not as bad as Palacharla et al. claim) – also note, FO4s/clock is shrinking

Page 6: CS 7810    Lecture 2

Update on Wire Delays

From “Future of Wires”, Ho, Mai, Horowitz

Page 7: CS 7810    Lecture 2

Register Rename Logic

MapTable

DependenceCheckLogic

MuxLogicalSourceRegs

LogicalDestRegs

LogicalSource Reg

PhysicalSourceRegs

PhysicalDestRegs

Free Pool

Page 8: CS 7810    Lecture 2

Map Table – RAM

Phys reg idNum entries =

Num logical regs

Shadow copies (shift register)

7-bits 7-bits 7-bits 7-bits 7-bits

Page 9: CS 7810    Lecture 2

Map Table – CAM

Logical reg idNum entries =Num phys regs

Shadow copies

5-bits 1-bit

valid

1-bit

Page 10: CS 7810    Lecture 2

Delay Model

cellWire length = C + 3 x IW

Delay = RC = c0 + c1 x IW + c2 x IW2

Rename delay ~ IW

The wire delay component increases as we shrink to 0.18

Problems:• They assume that wire delay/ (in ns) remains constant.• No window size?

Page 11: CS 7810    Lecture 2

Wakeup Logic

rdyL rdyRtagRtagL

or = = or

tag1 tagIW

rdyL rdyRtagRtagL

.

.

.

.

.

.

Page 12: CS 7810    Lecture 2

Wakeup Logic

• CAM array wire length ~ issue width x winsize

• Capacitive load ~ winsize

• Matchline length ~ issue width

• Issue width has a greater impact on delay as it influences tagdrive and tagmatch (the quadratic components are not very dominant)

• For smaller features, the wire delays dominate

Page 13: CS 7810    Lecture 2

Selection Logic

Issue window

req grant

anyreq enable

enable

Arbiter cell

Page 14: CS 7810    Lecture 2

Selection Logic

• Multiple FUs are handled by having more stages in series – further increases selection logic delay

• Delay ~ log(WINSIZE)

• Wire lengths ~ WINSIZE, but are ignored – hence, delay scales very well with feature size

Page 15: CS 7810    Lecture 2

Bypass Delay

• The number of bypass paths equals 2xIW2xS (S is the number of pipeline stages)

• Wire length ~ IW, hence, delay ~ IW2

• The layout and pipeline depth (capacitive load) also matter

Page 16: CS 7810    Lecture 2

Summary of Results

IssueWidth

WindowSize

RenameDelay (ps)

Wakeup +Select (ps)

BypassDelay (ps)

48

3264

1577.91710.5

2903.73369.4

184.91056.4

48

3264

627.2726.6

1248.41484.8

184.91056.4

48

3264

351.0427.9

578.0724.0

184.91056.4

0.8m technology

0.35m technology

0.18m technology

Page 17: CS 7810    Lecture 2

Bottlenecks

• Wakeup+Select and Bypass have the longest delays and represent atomic operations

• Pipelining will prevent back-to-back operations

• Increased issue width / window size / wire delays exacerbate the problem (also for the register file and cache)

Page 18: CS 7810    Lecture 2

Dependence-Based Microarchitecture

r3 r1 + r2r4 r3 + r2r5 r4 + r2r6 r4 + r2r7 r6 + r2r8 r5 + r2r9 r1 + r2

r1 1r2 1r3 0…

FIFOs

RdyOperands

Page 19: CS 7810    Lecture 2

Dependence-Based Microarchitecture

r3 r1 + r2

r3 r1 + r2r4 r3 + r2r5 r4 + r2r6 r4 + r2r7 r6 + r2r8 r5 + r2r9 r1 + r2

r1 1r2 1r3 0…

FIFOs

RdyOperands

Page 20: CS 7810    Lecture 2

Dependence-Based Microarchitecture

r4 r3 + r2r3 r1 + r2

r3 r1 + r2r4 r3 + r2r5 r4 + r2r6 r4 + r2r7 r6 + r2r8 r5 + r2r9 r1 + r2

r1 1r2 1r3 0…

FIFOs

RdyOperands

Page 21: CS 7810    Lecture 2

Dependence-Based Microarchitecture

r5 r4 + r2r4 r3 + r2r3 r1 + r2

r3 r1 + r2r4 r3 + r2r5 r4 + r2r6 r4 + r2r7 r6 + r2r8 r5 + r2r9 r1 + r2

r1 1r2 1r3 0…

FIFOs

RdyOperands

Page 22: CS 7810    Lecture 2

Dependence-Based Microarchitecture

r5 r4 + r2r4 r3 + r2r3 r1 + r2 r6 r4 + r2

r3 r1 + r2r4 r3 + r2r5 r4 + r2r6 r4 + r2r7 r6 + r2r8 r5 + r2r9 r1 + r2

r1 1r2 1r3 0…

FIFOs

RdyOperands

Page 23: CS 7810    Lecture 2

Dependence-Based Microarchitecture

r5 r4 + r2r4 r3 + r2r3 r1 + r2

r7 r6 + r2r6 r4 + r2

r3 r1 + r2r4 r3 + r2r5 r4 + r2r6 r4 + r2r7 r6 + r2r8 r5 + r2r9 r1 + r2

r1 1r2 1r3 0…

FIFOs

RdyOperands

Page 24: CS 7810    Lecture 2

Dependence-Based Microarchitecture

r8 r5 + r2r5 r4 + r2r4 r3 + r2r3 r1 + r2

r7 r6 + r2r6 r4 + r2

r3 r1 + r2r4 r3 + r2r5 r4 + r2r6 r4 + r2r7 r6 + r2r8 r5 + r2r9 r1 + r2

r1 1r2 1r3 0…

FIFOs

RdyOperands

Page 25: CS 7810    Lecture 2

Dependence-Based Microarchitecture

r8 r5 + r2r5 r4 + r2r4 r3 + r2r3 r1 + r2

r7 r6 + r2r6 r4 + r2 r9 r1 + r2

r3 r1 + r2r4 r3 + r2r5 r4 + r2r6 r4 + r2r7 r6 + r2r8 r5 + r2r9 r1 + r2

r1 1r2 1r3 0…

FIFOs

RdyOperands

Page 26: CS 7810    Lecture 2

Dependence-Based Microarchitecture

r8 r5 + r2r5 r4 + r2r4 r3 + r2r3 r1 + r2

r7 r6 + r2r6 r4 + r2 r9 r1 + r2

r3 r1 + r2r4 r3 + r2r5 r4 + r2r6 r4 + r2r7 r6 + r2r8 r5 + r2r9 r1 + r2

r1 1r2 1r3 0…

FIFOs

RdyOperands

r1 r2

Page 27: CS 7810    Lecture 2

Dependence-Based Microarchitecture

r8 r5 + r2r5 r4 + r2r4 r3 + r2

r7 r6 + r2r6 r4 + r2

r3 r1 + r2r4 r3 + r2r5 r4 + r2r6 r4 + r2r7 r6 + r2r8 r5 + r2r9 r1 + r2

r1 1r2 1r3 1…

FIFOs

RdyOperands

r3 r9

Page 28: CS 7810    Lecture 2

Dependence-Based Microarchitecture

r8 r5 + r2r5 r4 + r2

r7 r6 + r2r6 r4 + r2

r3 r1 + r2r4 r3 + r2r5 r4 + r2r6 r4 + r2r7 r6 + r2r8 r5 + r2r9 r1 + r2

r1 1r2 1r3 1…

FIFOs

RdyOperands

r4

Page 29: CS 7810    Lecture 2

Dependence-Based Microarchitecture

r8 r5 + r2 r7 r6 + r2

r3 r1 + r2r4 r3 + r2r5 r4 + r2r6 r4 + r2r7 r6 + r2r8 r5 + r2r9 r1 + r2

r1 1r2 1r3 1…

FIFOs

RdyOperands

r5 r6

Page 30: CS 7810    Lecture 2

Pros and Cons

• Wakeup and select over a subset of issue queue entries (only FIFO heads)

• Under-utilization as FIFOs do not get filled (causes about 5% IPC loss) – but it is not hard to increase their sizes

• You still need an operand-rdy table

Page 31: CS 7810    Lecture 2

Clustered Microarchitectures

Page 32: CS 7810    Lecture 2

Clustered Microarchitectures

• Simplifies wakeup+select and bypassing

• Dependence-based, hence most communication is local

• Low porting requirements on register file, issue queue

• IPC loss of 6.3%, but a clock speed improvement

Page 33: CS 7810    Lecture 2

Conclusions

• As issue width and window size increase, the delays of most structures go up dramatically

• Dominant wire delays exacerbate the problem

• Hence, to support large widths, build smaller cores that communicate with each other

• With dependence information, it is possible to minimize communication costs

Page 34: CS 7810    Lecture 2

Next Class’ Paper

• “Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures”, ISCA’00

• Do not get bogged down in details & methodology


Recommended