UU PP CC
Trace-Level ReuseTrace-Level Reuse
A. González, J. Tubella and C. Molina
Dpt. d´Arquitectura de Computadors
Universitat Politècnica de Catalunya
1999 International Conference on Parallel Processing ICPP´99
UU PP CC September 21, 1999 ICPP´99 2
MotivationMotivation
Increase performance by overcoming
dataflow limitation
DATA SPECULATION Exploits predictability of values
DATA REUSE Exploits redundancy of computations
UU PP CC September 21, 1999 ICPP´99 3
MotivationMotivation
Redundant computations
are rather frequent code
loops, recursive subroutines data
finite domain of values
The results could be reused
instead of recomputed
OUT = f (IN)
OUT = f (IN)
OUT = f (IN)
dynamicexecution
stream
redundantcomputations
UU PP CC September 21, 1999 ICPP´99 4
MotivationMotivation
Reuse granularity an instruction a sequence of instructions
TRACE-LEVEL REUSE
Performance potential of data reuse at instruction-level at trace-level
UU PP CC September 21, 1999 ICPP´99 5
OutlineOutline
Trace-level reuse
Performance potential
A first approach
Related work
Conclusions
UU PP CC September 21, 1999 ICPP´99 6
Trace-Level Reuse Trace-Level Reuse
Trace Any dynamic sequence of instructions
Goal Avoid the execution of a trace by reusing its
resultsprovided that the same trace with the same inputs has
already been executed
Advantages Reduces other machine resources utilization Reduces time to compute results Allows the processor to exceed the dataflow limit
UU PP CC September 21, 1999 ICPP´99 7
Trace-Level Reuse Trace-Level Reuse
Hardware scheme
Main Issues Reuse Trace Memory (RTM) Dynamic trace collection Reuse test State update
UU PP CC September 21, 1999 ICPP´99 8
Reuse Trace Memory (RTM)Reuse Trace Memory (RTM)
RTM stores candidate traces to be reused
Initial
Address
Input registers
identifiers&contents
Input memory
addresses&contents
Output registers
identifiers&contents
output memory
addresses&contents
Next
Address
Trace input Trace output
TRACE
INPUT
OUTPUT
UU PP CC September 21, 1999 ICPP´99 9
Dynamic trace collectionDynamic trace collection
Chooses candidate traces Initial address Next address
Input and output trace locations are
computed at execution-time and stored
along with their values in RTM
UU PP CC September 21, 1999 ICPP´99 10
Reuse Test & State UpdateReuse Test & State Update
Reuse test At some points of the execution the reused test is
performed Checks if a trace input, stored in RTM, matches
the current execution state
State update Writes output trace values to output trace
locations
REUSE LATENCY Reuse test plus State update
UU PP CC September 21, 1999 ICPP´99 11
OutlineOutline
Trace-level reuse
Performance Potential
A first approach
Related work
Conclusions
UU PP CC September 21, 1999 ICPP´99 12
Performance PotentialPerformance Potential
Base-line machine ISA: Alpha Only constrained by:
Data dependences Data dependences + Finite instruction window
Reuse engine Perfect trace reuse
Maximum-length tracesMinimum number of traces
UU PP CC September 21, 1999 ICPP´99 13
Performance Potential Performance Potential
Instruction-level reuse (ILR) Perfect instruction reuse engine:
All previous executed instances of each instruction are checked for a possible reuse
Maximum reusability: almost 90%
0102030405060708090
100
r e
u s
a b
i l
i t
y
UU PP CC September 21, 1999 ICPP´99 14
ILR ILR
Performance limits Base-line machine
constrained by data dependences Reuse engine: 1-cycle latency
0
0,5
1
1,5
2
2,5
3
3,5
4
sp
ee
d-u
p
UU PP CC September 21, 1999 ICPP´99 15
ILR ILR
Performance limits Base-line machine constrained by
data dependencesdata dependences and instruction window
Reuse latency: 1 to 4 cycles
1
1,1
1,2
1,3
1,4
1,5
sp
ee
d-u
p
1 2 3 4
Infinite IW
256-entry IW
UU PP CC September 21, 1999 ICPP´99 16
ILR ILR
Performance limits Moderate potential with a perfect reuse engine
Instruction latency is reduced The reuse of a chain of dependent instructions is still a sequential process
Source operands must be ready
UU PP CC September 21, 1999 ICPP´99 17
Performance Potential Performance Potential
Trace-level reuse (TLR) Perfect reuse engine
Traces consist of maximum-length dynamic sequences of reusable instructions
– Upper bound of the maximum reusability
– Lower bound of the minimum traces
I1I2I3I4I5I6
TRACE
UU PP CC September 21, 1999 ICPP´99 18
TLR TLR
Average trace size: 15.0 instructions FP: 11.7 INT: 20.3
1
10
100
tra
ce
siz
e
203 116
UU PP CC September 21, 1999 ICPP´99 19
TLR TLR
Performance limits Base-line machine constrained by
data dependences ans instruction window (256-entry) Reuse engine latency
ConstantLinear: f(#INPUTS+#OUTPUTS)
0
1
2
3
4
sp
ee
d-u
p
1 2 3 4 (I+O)/32 (I+O)/16 (I+O)/8 (I+O)/4 (I+O)/2 I+O
CONSTANT LINEAR
UU PP CC September 21, 1999 ICPP´99 20
OutlineOutline
Trace-level reuse
Performance potential
A first approach
Related work
Conclusions
UU PP CC September 21, 1999 ICPP´99 21
A First ApproachA First Approach
Reuse Trace Memory (RTM) Indexed by trace initial address (4-way and 8-way) Maximum number of input and output values:
8 register values 4 memory values
Sizes512 entries (4 different entries per initial address)4K entries (8 entries per initial address)32K entries (16 entries per initial address)256K entries (16 entries per initial address)
UU PP CC September 21, 1999 ICPP´99 22
A First ApproachA First Approach
In-order execution Reuse test performed for every fetch operation
PC
Instruction Cache
RTM
RT
M e
nt r
y
Reuse Test
Execute CommitFetch Decode
UU PP CC September 21, 1999 ICPP´99 23
A First ApproachA First Approach
Dynamic trace collection Built traces have all instructions reusable
an additional memory to check instruction reusability is needed
Fixed-length tracesstarting at any address
Trace expansion on reuse hit
UU PP CC September 21, 1999 ICPP´99 24
Reusable InstructionsReusable Instructions
25% reusability for a 4K-entry RTM
0
10
20
30
40
50
60
70
reu
sa
ble
in
str
uc
tio
ns
ILR ILR-E I1-E I2-E I3-E I4-E I5-E I6-E I7-E I8-E
512
4K
32K
256K
UU PP CC September 21, 1999 ICPP´99 25
Trace SizeTrace Size
6 instructions for a 4K-entry RTM
0
1
2
3
4
5
6
7
8
tra
ce
siz
e
ILR ILR-E I1-E I2-E I3-E I4-E I5-E I6-E I7-E I8-E
512
4K
32K
256K
UU PP CC September 21, 1999 ICPP´99 26
Related workRelated work
Data Reuse Software implementation
Memoization [Richardson,92] Hardware implementation
Tree Machine [Harbison,82]
At instruction-level Reuse Buffer [Sodani and Sohi,97] Register renaming [Jourdan et al.,98] Redundant Computation Buffer [Molina, González and Tubella,99]
At “trace”-level Result cache [Richardson,93] [Oberman and Flynn,95] Basic block reuse [Huang and Lilja,99]
UU PP CC September 21, 1999 ICPP´99 27
ConclusionsConclusions
Increasing the granularity of reuse from
instructions to traces Less reusability More effective
Fetch band-width is reduced Effective instruction window size is increased Number of operations per reused instruction is reduced DATA DEPENDENCES ARE BROKEN
UU PP CC September 21, 1999 ICPP´99 28
ConclusionsConclusions
Concentrate effort in divising strategies to choose reusable traces
High-level structures Compiler assistance
reducing the reuse test overhead Boolean test Invalidate/validate RTM entries