December 5th, 2004Joseph Sharkey
Presented at the Fourth Workshop on Power-Aware Computer Systems (PACS’04)
http://www.cs.binghamton.edu/~lowpower
Reducing Delay and Power Consumption of the Wakeup Logic through Instruction
Packing and Tag Memoization
Joseph Sharkey, Dmitry Ponomarev, Kanad Ghose, Oguz Ergin*State University of New York at Binghamton
* Currently with Intel Labs Barcelona, Spain
Presented By: Joseph Sharkey
December 5th, 2004Joseph Sharkey
Presented at the Fourth Workshop on Power-Aware Computer Systems (PACS’04)
http://www.cs.binghamton.edu/~lowpower
Introduction
• Modern superscalar processors use dynamic instruction scheduling
• Dynamic schedulers are implemented in the form of issue queues
• Destination tags are broadcast every cycle and they are associatively matched against the locally stored source tags– Large access delays– Significant power consumption
December 5th, 2004Joseph Sharkey
Presented at the Fourth Workshop on Power-Aware Computer Systems (PACS’04)
http://www.cs.binghamton.edu/~lowpower
Related Work
• Partition queue into segments– Power off unused segments
• Energy efficient comparators– Dissipate energy predominately on a tag match
• Reducing the number of comparators used
December 5th, 2004Joseph Sharkey
Presented at the Fourth Workshop on Power-Aware Computer Systems (PACS’04)
http://www.cs.binghamton.edu/~lowpower
Instruction Packing
• The Idea: to pack multiple instructions into a single issue queue entry– Significantly reduces the amount of CAM logic and
the length of the tag busses
TraditionalIssue
Queue Issue Queue withInstruction Packing
Tag Broadcast
I1
I2
In… I1 I2
I3… …
December 5th, 2004Joseph Sharkey
Presented at the Fourth Workshop on Power-Aware Computer Systems (PACS’04)
http://www.cs.binghamton.edu/~lowpower
Instruction Packing
• Motivation:
– Not all instructions require comparators for two source operands
• Only 17% require both comparators!
– Traditional hardware designed for “Worst Case Scenario” – not complexity-effective
December 5th, 2004Joseph Sharkey
Presented at the Fourth Workshop on Power-Aware Computer Systems (PACS’04)
http://www.cs.binghamton.edu/~lowpower
Instruction Packing
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
gzip vpr
gcc mcfpars
er
vorte
x
bzip2
twolf
wupwise sw
im
mgrid
applu
mesa artequ
akeAve
rage
0 Non-ready Operands 1 Non-ready Operand 2 Non-ready Operands
December 5th, 2004Joseph Sharkey
Presented at the Fourth Workshop on Power-Aware Computer Systems (PACS’04)
http://www.cs.binghamton.edu/~lowpower
Traditional Issue Queue Entry
• Entry Allocated bit (A)• Payload Area (opcode, FU type, destination
register tag, literals)• Tag, comparator, and valid bit for each
source (Tag CAM 1, Tag CAM 2, V1, V2)• Ready Bit (R)
December 5th, 2004Joseph Sharkey
Presented at the Fourth Workshop on Power-Aware Computer Systems (PACS’04)
http://www.cs.binghamton.edu/~lowpower
Instruction Packing:Issue Queue Entry Format
• Left-half allocate bit (AL)• Source tag and comparator (Tag CAM Left)• Source valid left bit (SVL)• Left Payload Area
December 5th, 2004Joseph Sharkey
Presented at the Fourth Workshop on Power-Aware Computer Systems (PACS’04)
http://www.cs.binghamton.edu/~lowpower
Instruction Packing:Issue Queue Entry Format
• Entry allocated bit (A)• Mode bit (Mode)
– 0: Multiple instructions in the entry– 1: A single instruction in the entry
• Ready bit (R)– Use only if Mode bit is 1
December 5th, 2004Joseph Sharkey
Presented at the Fourth Workshop on Power-Aware Computer Systems (PACS’04)
http://www.cs.binghamton.edu/~lowpower
Instruction Packing
Tag CAMLeft
Tag CAMRightA
MODE
AL
LeftPayload
AreaR
SVL
SVR
RightPayload
Area
AR
MUXSwitch
December 5th, 2004Joseph Sharkey
Presented at the Fourth Workshop on Power-Aware Computer Systems (PACS’04)
http://www.cs.binghamton.edu/~lowpower
Instruction Packing:Allocating an Instruction to a Half Entry
MUXSwitch
ADD P71 P23 P49P23
23 ADD71
101 0 49
Right Payload Area
(P71 = P23 + P49)
December 5th, 2004Joseph Sharkey
Presented at the Fourth Workshop on Power-Aware Computer Systems (PACS’04)
http://www.cs.binghamton.edu/~lowpower
Instruction Packing:Allocating an Instruction to a Full Entry
MUXSwitch
SUB P72 P52P48P48
48 SUB72
11 00 0
P52 (P72 = P48 – P52)
Right Payload Area
52
December 5th, 2004Joseph Sharkey
Presented at the Fourth Workshop on Power-Aware Computer Systems (PACS’04)
http://www.cs.binghamton.edu/~lowpower
Entry Allocation
• Search the AL, AR, and A bits in parallel• If instruction contains at most one non-
ready source:– Allocate a “half” entry
• Otherwise:– Allocate a “full” entry
December 5th, 2004Joseph Sharkey
Presented at the Fourth Workshop on Power-Aware Computer Systems (PACS’04)
http://www.cs.binghamton.edu/~lowpower
Instruction Wakeup
• Same as traditional design– Comparators set the corresponding valid
bits on a match.
• For a “full” entry:– AND SVL and SVR bits to set the R bit
• For a “half” entry:– Simply set the associated SVL/SVR bit
December 5th, 2004Joseph Sharkey
Presented at the Fourth Workshop on Power-Aware Computer Systems (PACS’04)
http://www.cs.binghamton.edu/~lowpower
Instruction Selection• “Half” entry
– Each half has its own select line, driven by the SVL and SVR bits, respectively.
• “Full” entry– “R” bit drives the request on the “right-half” request line.– “Left-half” request line is gated off
December 5th, 2004Joseph Sharkey
Presented at the Fourth Workshop on Power-Aware Computer Systems (PACS’04)
http://www.cs.binghamton.edu/~lowpower
Instruction Packing: Instruction Issue
MUXSwitch
23ADD
71 101 1 49
48 SUB7211 11 1
Right Payload Area
52 52
MUXSwitch
To Register File
To Register File
ADD P71 P23 P49
SUB P72 P48 P52
December 5th, 2004Joseph Sharkey
Presented at the Fourth Workshop on Power-Aware Computer Systems (PACS’04)
http://www.cs.binghamton.edu/~lowpower
Benefits of Instruction Packing
• Area reduction– Significant reduction in the amount of CAM logic– Roughly the same amount of RAM (rearranged)
• Delay reduction– Due to shorter tag broadcast lines, and bit lines
• Power reduction– Again due to shorter tag broadcast lines and bit lines– Fewer comparators
• Minimal IPC degradation– 83% of instructions only require a half-entry
December 5th, 2004Joseph Sharkey
Presented at the Fourth Workshop on Power-Aware Computer Systems (PACS’04)
http://www.cs.binghamton.edu/~lowpower
Tag Memoization
• Motivation:– Higher order bits of consecutive tag
broadcasts are likely to be the same• 35 • 32 • 41
→ 0100011b→ 0100011→ 0100000b
→ 0101011b
b
→ 0100000b
→ 0101011b
December 5th, 2004Joseph Sharkey
Presented at the Fourth Workshop on Power-Aware Computer Systems (PACS’04)
http://www.cs.binghamton.edu/~lowpower
Tag Memoization
U L
D
S
MUX
drive_upper
clk
match
December 5th, 2004Joseph Sharkey
Presented at the Fourth Workshop on Power-Aware Computer Systems (PACS’04)
http://www.cs.binghamton.edu/~lowpower
Tag Memoization
U L
D
S
MUX
drive_upper
clk
match
= 0
December 5th, 2004Joseph Sharkey
Presented at the Fourth Workshop on Power-Aware Computer Systems (PACS’04)
http://www.cs.binghamton.edu/~lowpower
Tag Memoization
U L
D
S
MUX
drive_upper
clk
match
= 0
December 5th, 2004Joseph Sharkey
Presented at the Fourth Workshop on Power-Aware Computer Systems (PACS’04)
http://www.cs.binghamton.edu/~lowpower
Tag Memoization
U L
D
S
MUX
drive_upper
clk
match
December 5th, 2004Joseph Sharkey
Presented at the Fourth Workshop on Power-Aware Computer Systems (PACS’04)
http://www.cs.binghamton.edu/~lowpower
Tag Memoization
U L
D
S
MUX
drive_upper
clk
match
= 1
December 5th, 2004Joseph Sharkey
Presented at the Fourth Workshop on Power-Aware Computer Systems (PACS’04)
http://www.cs.binghamton.edu/~lowpower
Tag Memoization
U L
D
S
MUX
drive_upper
clk
match
= 1
December 5th, 2004Joseph Sharkey
Presented at the Fourth Workshop on Power-Aware Computer Systems (PACS’04)
http://www.cs.binghamton.edu/~lowpower
Tag Memoization
• Targets width of tag broadcasts– Nicely complements Instruction Packing,
which reduces the length of tag busses
• NO IMPACT ON IPC– DOES NOT interrupt, hinder, or change the
order of tag broadcasts
December 5th, 2004Joseph Sharkey
Presented at the Fourth Workshop on Power-Aware Computer Systems (PACS’04)
http://www.cs.binghamton.edu/~lowpower
Instruction Packing: Results
0.0000.2000.4000.6000.8001.0001.2001.4001.600
IntAvg FPAvg Average
32IQ 16IQ_PACK
• Less than 0.5% IPC degradation on the average when packing 32-entry queue into 16 entries– Worst case: twolf – 4% IPC loss– 12 benchmarks have less than 0.1% IPC loss
December 5th, 2004Joseph Sharkey
Presented at the Fourth Workshop on Power-Aware Computer Systems (PACS’04)
http://www.cs.binghamton.edu/~lowpower
Instruction Packing: Area and Delay Reduction
• 26.7% reduction in issue queue area• 21.6% reduction in issue queue delay
CMOS layouts of a CAM cell and an SRAM bitcell
December 5th, 2004Joseph Sharkey
Presented at the Fourth Workshop on Power-Aware Computer Systems (PACS’04)
http://www.cs.binghamton.edu/~lowpower
Instruction Packing: Results
• 38% reduction in wakeup power
December 5th, 2004Joseph Sharkey
Presented at the Fourth Workshop on Power-Aware Computer Systems (PACS’04)
http://www.cs.binghamton.edu/~lowpower
Tag Memoization: Results
December 5th, 2004Joseph Sharkey
Presented at the Fourth Workshop on Power-Aware Computer Systems (PACS’04)
http://www.cs.binghamton.edu/~lowpower
Tag Memoization: Results Summary
• 12.9% power savings• 16.4% power savings with intelligent bus
arbitration
• Only 4% increase in delay– Smaller comparators = faster– Added delay of NAND gate and a pass transistor.
December 5th, 2004Joseph Sharkey
Presented at the Fourth Workshop on Power-Aware Computer Systems (PACS’04)
http://www.cs.binghamton.edu/~lowpower
Combining Tag Memoization &Instruction Packing
• Combined 44.7% power savings• 16% reduction in delay• 0.5% performance degradation
0.00%5.00%
10.00%15.00%20.00%25.00%30.00%35.00%40.00%45.00%50.00%
gzip vp
rgc
cmcf
parse
rvo
rtex
bzip2 twolf
wupwisesw
immgri
dap
plumes
a arteq
uake
IntAvg
FPAvgAve
rage
Tag Memoization Instruction Packing Memo + Pack
December 5th, 2004Joseph Sharkey
Presented at the Fourth Workshop on Power-Aware Computer Systems (PACS’04)
http://www.cs.binghamton.edu/~lowpower
Conclusions• We proposed two complementary techniques for
reducing the energy dissipation of the instruction wakeup logic
• Instruction Packing reduces the length of the tag busses by sharing one issue queue entry among two instructions
• Tag Memoization avoids broadcasts of upper order tag bits if they match the most recently driven on the same bus
• Major results– Wakeup energy reduction: 44.7%– Wakeup delay reduction: 16%– IPC degradation: 0.5%
December 5th, 2004Joseph Sharkey
Presented at the Fourth Workshop on Power-Aware Computer Systems (PACS’04)
http://www.cs.binghamton.edu/~lowpower
Questions/Comments?
http://www.cs.binghamton.edu/~lowpower
Joseph Sharkey
December 5th, 2004Joseph Sharkey
Presented at the Fourth Workshop on Power-Aware Computer Systems (PACS’04)
http://www.cs.binghamton.edu/~lowpower
Tag Memoization: Delay Analysis
U L
D
S
MUX
drive_upper
clk
match
55ps
18ps: Turned on transmission gate within
MUX
50ps LESS than single long comparator