Date post: | 15-Jan-2016 |
Category: |
Documents |
Upload: | zander-reed |
View: | 219 times |
Download: | 0 times |
Reconfigurable Microprocessors
Lih Wen Koh
05s1 COMP4211 presentation
18 May 2005
2
Presentation Overview
Current Research Direction
Related Work
Experiments
What Next?
3
Current Research Direction
Execute
Address + ALU1 ALU2 FP + FP *, ÷, √
Hardware Components of MIPS R10000
Fetch
Instruction CacheInstruction Predecode
Branch History Table
InstructionTLB
Decode
Instruction DecodeActive List(32 entries)
Free Register Lists (1 for Int, 1 For FP)
Register Map Tables(1 for Int, 1 for FP)
Integer Queue(16 entries)
FP Queue(16 entries)
Mem Queue(16 entries)
Integer Registers / Bypass64 x 64 bits
FP Registers / Bypass64 x 64 bits
Issue
Write
Data TLB
Data Cache
[Yeager96]
Wide superscalar, out-of-order execution processor core
Exploits ILP
But true data dependencies are inherent in application programs
MIPS R10k, NetBurst, AMD etc. use bypass network to forward just-computed result allow back-to-back issue of dependent instructions
Complexity of bypass network grows quadratic w.r.t. issue width
4
Current Research Direction
Observation 1: Multi-cycle broadcast
Wire delays – accounted for in Intel NetBurst
Allows higher processor clock frequency at the cost of reduced IPC
Observation 2: FP execution unit is idle most of the time, even in FP-intensive applications (5-10%)
[Sassone04]
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
MediaBench Applications
Proportion of Functional Unit Type Requested
Rd/Wr Ports
FP_MULT/DIV
FP_ALU
Int_MULT/DIV
Int_ALU0%
10%20%
30%40%
50%60%
70%80%
90%100%
164.
gzip.g
raph
ic
164.
gzip.lo
g
164.
gzip.p
rogr
am
164.
gzip.ra
ndom
164.
gzip.so
urce
176.
gcc
181.
mcf
197.
parse
r
168.
wupwise
171.
swim
173.
applu
179.
art
183.
equak
e
188.
amm
p
301.
apsi
SPEC2000 Applications
Proportion of Functional Unit Type Requested
Rd/Wr Ports
FP_MULT/DIV
FP_ALU
Int_MULT/DIV
Int_ALU
5
Literature Survey
[Epalza04]: Dynamic Allocation of Functional Units in Superscalar Processors
Switch the execution mode of idle floating-point units to four additional integer ALUs
Addition of bypass networks add 1 cycle latency to the modified FPU.
19% speedup for SPECint2000
3.5% speedup for SPECfp2000
Issues:
Need to improve control for mode switching
6
Plans
Other patterns:
y1 y3
x
y2
Possible 2nd input for node y1/y2/y3:1. from register file2. from node x
x1 x2
z
y
Possible 2nd input for node y:1. from register file2. from node x13. from node x2
Possible 2nd input for node z:1. from register file2. from node x13. from node x24. from node y
y1 y2
z
xPossible 2nd input for node y1:1. from register file2. from node x
Possible 2nd input for node y2:1. from register file2. from node x
Possible 2nd input for node z:1. from register file2. from node x3. from node y14. from node y2
z1 z2
y
xPossible 2nd input for node y:1. from register file2. from node x
Possible 2nd input for node z1:1. from register file2. from node x3. from node y
Possible 2nd input for node z2:1. from register file2. from node x3. from node y
x1 x2
y2y1
Possible 2nd input for node y1:1. from register file2. from node x13. from node x2
Possible 2nd input for node y2:1. from register file2. from node x13. from node x2
Possible 1st input for node y2:1. from node x12. from node x2
Possible inputs:1. from register file2. from node x13. from node x24. from node x3
x1 x2 x3
y
7
Related Work
[Palacharla97] Dependence-based (FIFO queues + clustered execution units)
8
Related Work
Extension to rePLay framework [Yehia04]
9
Experiment : Chaining pairs of dependent instructions
[Intel01] Double-speed ALUs
from Register File
NormalInteger
ALU
3-1 InterlockCollapsing
ALU
Result of first instruction in
dependent sequence
Result of second instruction in dependent sequence
Carry Lookahead
Adder
LogicOperations
mux
4 stages
1 stage
Carry-Save Adder
LogicOperations
Control
Carry-Lookahead Adder + Logic
Operations4 stages
1 stage
[Vassiliadis96] 3-1 Interlock
Collapsing ALUs
10
Experiment : Chaining pairs of dependent instructions
Instruction Fetch Queue(IFQ)
Load/Store Queue(LSQ)
Register Update Unit(RUU)
Ready Queue
Operands ready EA ready
F_MEM
IntALUs Int Mult/Div Rd/Wr Ports FP Adders FP Mult/Div Chained ALU
Event Queue
Instruction WriteBack(Broadcast/Bypass Logic)
Branch Misprediction?If so, recover
Instruction Commit
Issue if requested functional unit is not busy
if the requested functional unit is IntALU && the list of in-flight instructions waiting only on the result of this instruction is non-empty && the chained ALU is not busy => schedule this instruction and the first obtained dependent instruction to the chained ALU
ruu_fetch()
ruu_dispatch()
ruu_issue()
ruu_writeback()
ruu_commit()
Modifications to
sim-outorder for
SimpleScalar
PISA.
11
Experiment : Chaining pairs of dependent instructions
2 CIALUs sufficient
IPC improvement of ~8%, solely due to
savings of broadcast cycles
Reduces utlization of IALUs by ~50%
Reduces up to 45% of queue entries
waiting for result
Up to 25% speedup as broadcast cycles
= 4
0%
5%
10%
15%
20%
25%
MediaBench Applications
Speedup on IPC for MediaBench Applications(fetch-decode-issue-commit w idth = 8, ruu:size = 32, #ialu = 8, #cialu = 2)
broadcast_delay = 1
broadcast_delay = 2
broadcast_delay = 3
broadcast_delay = 4
0%
5%
10%
15%
20%
25%
rawcaudio
rawdaudioepic
unepicencode
decodecjpeg
djpeg
mpeg2encode
mpeg2decode
pegwitenc
pegwitdec
MediaBench Applications
Speedup on IPC for MediaBench Applications(fetch-decode-issue-commit width = 8, ruu:size = 32, broadcast delay = 1 cycle)
#IntALU = 2, #CIntALU = 1
#IntALU = 4, #CIntALU = 1
#IntALU = 8, #CIntALU = 1
#IntALU = 2, #CIntALU = 2
#IntALU = 4, #CIntALU = 2
#IntALU = 8, #CIntALU = 2
#IntALU = 2, #CIntALU = 3
#IntALU = 4, #CIntALU = 3
#IntALU = 8, #CIntALU = 3
#IntALU = 2, #CIntALU = 4
#IntALU = 4, #CIntALU = 4
#IntALU = 8, #CIntALU = 4
12
What Next?
Chaining sequence of 3 dependent instructions, other patterns out of the 80.
Architectural impact of adding chained units
complexity of local bypass network etc.
Replace chained units by xALUs converted from the CSA trees in a FP
multiply/divide unit
Need to explore the hardware circuits of FP multiply/divide
Develop an adaptive configuration scheme – to best match the interconnections of
the swappable xALUs to the patterns of in-flight instructions.
Need to determine the most frequent subset of patterns
References[Vassiliadis96] High-Performance 3-1 Interlock Collapsing ALUs. James Phillips and Stamatis
Vassiliadis.
[Yeager96] The MIPS R10000 Superscalar Microprocessor. Kenneth C. Yeager. IEEE Micro 1996.
[Palacharla97] Subbarao Palacharla, Norman P. Jouppi, J.E. Smith. Complexity-Effective Superscalar Processor. ISCA 1997.
[Intel01] The Microarchitecture of the Pentium® 4 Processor. Glenn Hinton, Dave Sager, Mike Upton, Darrell Boggs, Doug Carmean, Alan Kyker, Patrice Roussel Intel Technology Journal Q1. 2001.
[Epalza04] Dynamic Reallocation of Functional Units In Superscalar Processors. Marc Epalza, Paolo Ienne, Daniel Mlynek. In the 9th Asia-Pacific Computer Systems Architecture Conference (ACSAC), 2004.
[Yehia04] From Sequences of Dependent Instructions to Functions: A Complexity-Effective Approach for Improving Performance without ILP or Speculation. Sami Yehia and Olivier Temam.
[Sassone04] Multicycle Broadcast Bypass: Too Readily Overlooked. Peter G. Sassone and D. Scott Wills, Proceedings of the Workshop on Complexity-Effective Design (WCED), May 2004.
Thank You
15
Overview of Research TopicGoal of this research:
“investigate the feasibility and potential benefit of effective, automated runtime compilation and execution of software binaries on reconfigurable microprocessors”
Software binaries executing only on superscalar processor
Profile committed instructions to identify critical code regions Identify and extract suitable
instructions from critical code regions for collapsing into complex, atomic instructions
Assembly-to-hardware mapping of collapsed instructions
On the next execution of the transformed critical region, load configuration for the reconfigurable logic
Transfer execution from superscalar processor to the reconfigurable unit
Monitor the execution of the coupled system
Continue normal execution of binary code following the transformed critical code region.
superscalarprocessor
reconfigurablelogic
Software Binaries
Reconfigurable Microprocessor
16
Motivations
Improved execution performance by exploiting parallelism and redundancy in
hardware.
Adaptation of hardware resources based on the dynamic behaviour of programs.
Availability of runtime profile allows exploitation of runtime optimizations otherwise difficult to exploit at compile time.
Compilation at the binary level allows execution of legacy software binaries.
Runtime compilation allows transparent migration of software code to hardware.