Trace-Level Reuse

UU PP CC

Trace-Level ReuseTrace-Level Reuse

A. González, J. Tubella and C. Molina

Dpt. d´Arquitectura de Computadors

Universitat Politècnica de Catalunya

1999 International Conference on Parallel Processing ICPP´99

UU PP CC September 21, 1999 ICPP´99 2

MotivationMotivation

Increase performance by overcoming

dataflow limitation

DATA SPECULATION Exploits predictability of values

DATA REUSE Exploits redundancy of computations



Redundant computations

are rather frequent code

loops, recursive subroutines data

finite domain of values

The results could be reused

instead of recomputed

OUT = f (IN)

OUT = f (IN)

OUT = f (IN)

dynamicexecution

stream

redundantcomputations



Reuse granularity an instruction a sequence of instructions

TRACE-LEVEL REUSE

Performance potential of data reuse at instruction-level at trace-level


OutlineOutline

Trace-level reuse

Performance potential

A first approach

Related work

Conclusions


Trace-Level Reuse Trace-Level Reuse

Trace Any dynamic sequence of instructions

Goal Avoid the execution of a trace by reusing its

resultsprovided that the same trace with the same inputs has

already been executed

Advantages Reduces other machine resources utilization Reduces time to compute results Allows the processor to exceed the dataflow limit


Trace-Level Reuse Trace-Level Reuse

Hardware scheme

Main Issues Reuse Trace Memory (RTM) Dynamic trace collection Reuse test State update


Reuse Trace Memory (RTM)Reuse Trace Memory (RTM)

RTM stores candidate traces to be reused

Initial

Address

Input registers

identifiers&contents

Input memory

addresses&contents

Output registers

identifiers&contents

output memory

addresses&contents

Next

Address

Trace input Trace output

TRACE

INPUT

OUTPUT


Dynamic trace collectionDynamic trace collection

Chooses candidate traces Initial address Next address

Input and output trace locations are

computed at execution-time and stored

along with their values in RTM


Reuse Test & State UpdateReuse Test & State Update

Reuse test At some points of the execution the reused test is

performed Checks if a trace input, stored in RTM, matches

the current execution state

State update Writes output trace values to output trace

locations

REUSE LATENCY Reuse test plus State update


OutlineOutline

Trace-level reuse

Performance Potential

A first approach

Related work

Conclusions


Performance PotentialPerformance Potential

Base-line machine ISA: Alpha Only constrained by:

Data dependences Data dependences + Finite instruction window

Reuse engine Perfect trace reuse

Maximum-length tracesMinimum number of traces


Performance Potential Performance Potential

Instruction-level reuse (ILR) Perfect instruction reuse engine:

All previous executed instances of each instruction are checked for a possible reuse

Maximum reusability: almost 90%

0102030405060708090

100

r e

u s

a b

i l

i t

y


ILR ILR

Performance limits Base-line machine

constrained by data dependences Reuse engine: 1-cycle latency

0

0,5

1

1,5

2

2,5

3

3,5

4

sp

ee

d-u

p


ILR ILR

Performance limits Base-line machine constrained by

data dependencesdata dependences and instruction window

Reuse latency: 1 to 4 cycles

1

1,1

1,2

1,3

1,4

1,5

sp

ee

d-u

p

1 2 3 4

Infinite IW

256-entry IW


ILR ILR

Performance limits Moderate potential with a perfect reuse engine

Instruction latency is reduced The reuse of a chain of dependent instructions is still a sequential process

Source operands must be ready


Performance Potential Performance Potential

Trace-level reuse (TLR) Perfect reuse engine

Traces consist of maximum-length dynamic sequences of reusable instructions

– Upper bound of the maximum reusability

– Lower bound of the minimum traces

I1I2I3I4I5I6

TRACE


TLR TLR

Average trace size: 15.0 instructions FP: 11.7 INT: 20.3

1

10

100

tra

ce

siz

e

203 116


TLR TLR

Performance limits Base-line machine constrained by

data dependences ans instruction window (256-entry) Reuse engine latency

ConstantLinear: f(#INPUTS+#OUTPUTS)

0

1

2

3

4

sp

ee

d-u

p

1 2 3 4 (I+O)/32 (I+O)/16 (I+O)/8 (I+O)/4 (I+O)/2 I+O

CONSTANT LINEAR


OutlineOutline

Trace-level reuse

Performance potential

A first approach

Related work

Conclusions


A First ApproachA First Approach

Reuse Trace Memory (RTM) Indexed by trace initial address (4-way and 8-way) Maximum number of input and output values:

8 register values 4 memory values

Sizes512 entries (4 different entries per initial address)4K entries (8 entries per initial address)32K entries (16 entries per initial address)256K entries (16 entries per initial address)



In-order execution Reuse test performed for every fetch operation

PC

Instruction Cache

RTM

RT

M e

nt r

y

Reuse Test

Execute CommitFetch Decode



Dynamic trace collection Built traces have all instructions reusable

an additional memory to check instruction reusability is needed

Fixed-length tracesstarting at any address

Trace expansion on reuse hit


Reusable InstructionsReusable Instructions

25% reusability for a 4K-entry RTM

0

10

20

30

40

50

60

70

reu

sa

ble

in

str

uc

tio

ns

ILR ILR-E I1-E I2-E I3-E I4-E I5-E I6-E I7-E I8-E

512

4K

32K

256K


Trace SizeTrace Size

6 instructions for a 4K-entry RTM

0

1

2

3

4

5

6

7

8

tra

ce

siz

e

ILR ILR-E I1-E I2-E I3-E I4-E I5-E I6-E I7-E I8-E

512

4K

32K

256K


Related workRelated work

Data Reuse Software implementation

Memoization [Richardson,92] Hardware implementation

Tree Machine [Harbison,82]

At instruction-level Reuse Buffer [Sodani and Sohi,97] Register renaming [Jourdan et al.,98] Redundant Computation Buffer [Molina, González and Tubella,99]

At “trace”-level Result cache [Richardson,93] [Oberman and Flynn,95] Basic block reuse [Huang and Lilja,99]


ConclusionsConclusions

Increasing the granularity of reuse from

instructions to traces Less reusability More effective

Fetch band-width is reduced Effective instruction window size is increased Number of operations per reused instruction is reduced DATA DEPENDENCES ARE BROKEN


ConclusionsConclusions

Concentrate effort in divising strategies to choose reusable traces

High-level structures Compiler assistance

reducing the reuse test overhead Boolean test Invalidate/validate RTM entries

Date post:	06-Jan-2016
Category:	Documents
Upload:	shayla
View:	27 times
Download:	3 times

Trace-Level Reuse

Documents