Date post: | 04-Jan-2016 |
Category: |
Documents |
Upload: | emily-julianna-phillips |
View: | 224 times |
Download: | 3 times |
An Integrated Temporal Partitioning and Mapping Framework for Handling
Custom Instructions on a Reconfigurable Functional Unit
Farhad Mehdipour†, Hamid Noori††, Morteza Saheb Zamani†, Kazuaki Murakami††, Mehdi Sedighi†, Koji Inoue††
†Computer and IT Engineering Department, Amirkabir University of Technology {mehdipur,szamani,msedighi}@aut.ac.ir
††Department of Informatics, Graduate School of Information Science and Electrical Engineering, Kyushu University
[email protected], {murakami,inoue}@i.kyushu-u.ac.jp
ACSAC 2006 - Shanghai, China Kyushu University
Agenda Introduction General overview of the architecture Generating Custom Instructions Reconfigurable Functional Unit (RFU) Tool Chain used for our quantitative
approach Integrated Temporal Partitioning and
Mapping The Integrated Framework Incremental Temporal Partitioning Algorithm Mapping Procedure
Experimental Results
ACSAC 2006 - Shanghai, China Kyushu University
Introduction Approaches for designing embedded SoCs
Application Specific Integrated Circuits (ASICs) Higher performance Lower power consumption Not flexible Expensive and time consuming design process
General Purpose Processors (GPPs) Availability of tools Programmability Low performance High power consumption
Application Specific Instruction-set Processors (ASIPs) More flexible than ASICs Higher performance than GPPs Long and costly design and verification
Extensible Processors More flexibility significant non-recurring engineering costs
ACSAC 2006 - Shanghai, China Kyushu University
General overview of the architecture
Adaptive Dynamic Extensible Processor
Base Processor
Reg FileFetch
Decode
Execute
Memory
Write
Augmented Hardware
RFU
Profiler
Sequencer
N-wayin-order
general RISC
Detects start addresses of
Hot Basic Blocks (HBBs)
Executes Custom
Instructions
Switches between main processor and
RFU
ACSAC 2006 - Shanghai, China Kyushu University
Operation modes
Applications
ProcessorProfiler
RFU
Training Mode
SequencerProcessor
Profiler
RFU Sequencer
Running Tools for Generating
Custom Instructions, Generating
Configuration Data for RFU
and Initializing Sequencer
Table
Training Mode Normal Mode
ProcessorProfiler
RFU Sequencer
Monitors PC and
Switches between
main processor and RFU
Executing CIs
ApplicationsApplications
Binary Rewritin
g
Profiler
Binary-Level
Profiling
Detecting Start
Address of HBBs
ACSAC 2006 - Shanghai, China Kyushu University
Integrating base processor with other components
Register File
ID/EXE Reg
RFU
ConfigurationMemory
Functional Unit
MUX SequencerSequencer
Table
EXE/MEM Reg Profiler ProfilerTable
GPP Augmented HW
OnlineTraining
ACSAC 2006 - Shanghai, China Kyushu University
Generation of Custom Instructions
Custom instructions Limited to one Hot Basic Block (HBB) Exclude floating point, multiply, divide and load instructions Include at most one STORE, at most one BRANCH/JUMP
and all other fixed point instructions
Simple algorithm for generating custom instructions HBBs usually include 10~40 instructions for Mibench Custom instruction generator is going to be executed on
the base processor (in online training mode)
ACSAC 2006 - Shanghai, China Kyushu University
Generating Custom Instructions4052c0 addiu $29,$29,-324052c8 mov.d $f0,$f124052d0 sw $18,24($29)4052d8 addu $18,$0,$64052e0 sw $31,28($29)4052e8 sw $16,16($29)4052f0 mfc1 $16,$f04052f8 mfc1 $17,$f1405300 srl $6,$17,0x14405308 andi $6,$6,2047405310 sltiu $2,$6,2047405318 addu $6,$6,$18405320 sltiu $2,$6,2047405328 lui $2,32783405330 and $17,$17,$2405338 andi $2,$6,2047405340 sll $2,$2,0x14405348 or $17,$17,$2405350 mtc1 $16,$f0405358 mtc1 $17,$f1405360 lw $31,28($29)405370 lw $16,16($29)405378 addiu $29,$29,32405380 jr $31
Finding the biggest sequence of instructions in the HBB that can be executed on the ACC
Moving the instructions and appending supportable instructions to the head of the detected instruction sequence after checking flow-dependency and anti-dependency
Moving the instructions and appending supportable instructions to the tail of the detected instruction sequence after checking flow-dependency and anti-dependency
Rewriting object code if instructions have been moved
Moving instructions, should not modify the logic of the application
Custom instruction generation is done without considering any other constraints.
ACSAC 2006 - Shanghai, China Kyushu University
Reconfigurable Functional Unit (RFU)
RFU is a matrix of Functional Units (FUs) RFU has configuration memory FUs support only logical operations,
add/subtract, shifts and compare RFU updates the PC after executing each CI RFU has variable delay which depends on
depth of DFG of Custom Instructions
ACSAC 2006 - Shanghai, China Kyushu University
RFU Architecture: A Quantitative Approach
22 programs of MiBench were chosen Simplescalar toolset was utilized for simulation RFU is a matrix of FUs
No of Inputs No of Outputs No of FUs Width Depth Connections Location of Inputs & Outputs
Coverage (Mapping) rate: Percentage of generated CIs that can be mapped on the RFU considering constraints
Considering frequency and weight in measurement CI Execution Frequency Weight (To equal number of executed instructions) Average = for all CIs (ΣFreq*Weight)
ACSAC 2006 - Shanghai, China Kyushu University
Tool Chain
Profiler Base Processor
Detecting Start Addr of HBBs
Reading HBBs from Obj Code
Custom Instruction Generator
Optimization (Constant
Propagation)
Simplescalar (PISA
Configuration) 22
Applications of Mibench
Generating DFG for HBBs
Updating DFG
Results are used for
designing RFU
Mapping CIs on the RFU
ACSAC 2006 - Shanghai, China Kyushu University
RFU Inputs (no constraint)
Input No Analysis-Optimized Version
0
20
40
60
80
100
120
1 3 5 7 9 11 13 15 17 19
Input No.
Co
vera
ge
Series1
96.3789.37 98.48
8
ACSAC 2006 - Shanghai, China Kyushu University
RFU Outputs (no constraint)
6
Output No. Analysis- Optimized Version
0
20
40
60
80
100
120
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Output No.
Co
vera
ge
96.58
ACSAC 2006 - Shanghai, China Kyushu University
RFU Architecture
Distributing Inputs in different rows Row1 = 7 Row 2 = 2 Row 3 = 2 Row 4 = 2 Row 5 = 1
Connections with Variable Length row1 row3 = 1 row1 row4 = 1 row1 row5 = 1 row2 row4 = 1
Synthesis results using Hitachi 0.18 μm
Area : 1.1534 mm2
Delay : 9.66 ns
ACSAC 2006 - Shanghai, China Kyushu University
Generating Custom Instruction for the Target RFU In our primary CI generator we did not
consider any constraints for the generated CIs and tried to generate CIs as large as possible.
Therefore, some of the generated CIs could not be mapped on the proposed RFU due to its constraints after fixing the architecture.
ACSAC 2006 - Shanghai, China Kyushu University
Customizing CI generator for the Target RFU – First Approach (CIGen) Some primary constraints of the
RFU (number of inputs, number of outputs and number of nodes) were added to our CI generator tool to generate CIs that are mappable.
In this approach the CI generator is unaware of the mapping process results
Some of CIs may not be ultimately mapped to the RFU due to the routing and connection constraints
CIs
RFU ArchitecturalConstraints
Mapping is successful?YES
NO
Rejected CIs(Have to be run on base
processor)
CI GenerationTool
FinalConfigurations
ACSAC 2006 - Shanghai, China Kyushu University
Customizing CI generator for the Target RFU – Second Approach Integrated Framework
Performs an integrated temporal partitioning and mapping process
Takes rejected CIs as input Partitions them to appropriate
mappable CIs
Advantages All generated CIs are mappable Using a mapping-aware temporal
partitioning process
Mapping on RFU
TemporalPartitioning
TemporalPartitions
Mapping is successful?
FinalConfigurations
IncrementalTemporal
Partitioning
IncrementalTemporal Partitions
NO
CIs generated byCI generation Tool
Mapping is successful?
Integrated Framework
NO
YES
YES
ACSAC 2006 - Shanghai, China Kyushu University
Integrated Framework- Incremental Temporal Partitioning Algorithm Incremental Temporal
Partitioning The node with the
highest ASAP level is selected and moved to the subsequent partition.
Nodes selection and moving order: 15, 13, 11, 9, 14, 12, 10, 8, 3 and 7.
2
6 4
0
1
5
RFU Map
0
1
2
3
8
9
4
10
11
5
12
13
6
14
15
7
Data Flow Graph of Input CI
1st Partition
2nd Partition
0
1
2
4 5 6
7
1st Partition
3
8
9
10
11
12
13
14
15
2nd Partition123
5
4
6 7 8
9
3
11
10 12 14 7
8 13 15
9
RFU Map
10
ACSAC 2006 - Shanghai, China Kyushu University
Mapping Custom Instructions
Mapping is the same as the well-known placement problem: Determining the appropriate positions for DFG
nodes on the RFU. Assigning CI instructions to FUs is done
based on the priority of the nodes.
ACSAC 2006 - Shanghai, China Kyushu University
An Example: Mapping of a CI on the RFU
2
3
4
1
1: SUBU R3, R0, R32: ADDU R10, R0, R03: SRA R8, R10, 0x34: SLT R2, R3, R85: BNE R0,400488, R2
ADDU
SRA
SLT
SUBU
BNE
R3R0 R0R0
R10
R8
R2
R2
R30x3
400488
2
3
4
5
1
A Custom Instruction
Data Flow GraphRFU Map
5
ACSAC 2006 - Shanghai, China Kyushu University
Customizing Mapping Tool
Spiral shaped mapping is possible thanks to the horizontal connections in the third and fourth rows of RFU
5 6
7 1
2
4 3
1: SLL R2,R17,0x12: ADDU R2, R2, R173: SLL R2, R2, 0x34: ADDU R2, R2, R175: SLL R2,R2, 0x46: ADDU R2, R2, R207: SLL R3, R18, 0x28: ADDU R3, R3, R2
SLL
ADDU
SLL
SLL
SLL
ADDU
R3R18 0x1R17
R2
R2
R2
R2
R3R17
0x3
ADDU
ADDU
R2
R2
R3
R17
0x4
R20
1
2
3
4
5
6
8
7
8
: Critical path
A Custom Instruction
Data Flow Graph
RFU Map
ACSAC 2006 - Shanghai, China Kyushu University
CIs length for Mibench applications
ACSAC 2006 - Shanghai, China Kyushu University
Percentage of rejected CIs for CIGen
0
10
20
30
40
50
60
70%
of
Re
jec
ted
CIs
bit
co
un
ts
blo
wfi
sh
blo
wfi
sh
(d
ec
)
cjp
eg
djp
eg fft
fft
(in
v)
gs
m (
de
c)
gs
m (
en
c)
lam
e
ACSAC 2006 - Shanghai, China Kyushu University
Initial and final number of partitions
0
10
20
30
40
50
60
70
80
90N
o. o
f P
art
itio
ns
bit
cn
ts
blo
wfi
sh
blo
wfi
sh
(d
ec
)
cjp
eg
djp
eg fft
fft
(in
v)
gs
m (
de
c)
gs
m (
en
c)
lam
e
rijn
da
el (
en
c)
rijn
da
el (
de
c)
sh
a
Initial No. of Partitions Final No. of Partitions
ACSAC 2006 - Shanghai, China Kyushu University
Maximum critical path length for CIs
0
1
2
3
4
5
6
7
8
Ma
xim
um
Cri
tic
al P
ath
Le
ng
th
bit
cn
ts
blo
wfi
sh
blo
wfi
sh
(de
c)
cjp
eg
djp
eg fft
fft
(in
v)
gs
m (
de
c)
gs
m (
en
c)
lam
e
rijn
da
el
(en
c)
rijn
da
el
(de
c)
sh
a
ACSAC 2006 - Shanghai, China Kyushu University
Performance Evaluation
issue 1-way
L1- I cache 32K, 2 way, 1 cycle latency
L1- D cache 32K, 4 way, 1 cycle latency
Unified L2 1M, 6 cycle latency
Execution units 1 integer, 1 floating point
RUU size 64
Fetch queue size 64
Simplescalar was configured to behave as a MIPS324K processor. The base processor supports MIPS instruction set.
22 applications of Mibench
ACSAC 2006 - Shanghai, China Kyushu University
Delay of RFU according to CI length
CI Length RFU Delay (ns)
1 1.38
2 2.28
3 3.12
4 4.89
5 6.47
6 7.57
7 8.65
8 9.66
Synopsys Tools + Hitachi 0.18μm
ACSAC 2006 - Shanghai, China Kyushu University
Speedup
1
1.2
1.4
1.6
1.8
2
2.2
2.4
Sp
eed
up
bit
co
un
ts
blo
wfi
sh
blo
wfi
sh
(de
c)
cjp
eg
djp
eg fft
fft
(in
v)
gs
m (
de
c)
gs
m (
en
c)
lam
e
rijn
da
el
(en
c)
rijn
da
el
(de
c)
sh
a
Speedup using integrated framework Speedup using CI generation tool
ACSAC 2006 - Shanghai, China Kyushu University
Conclusions
Proposing a reconfigurable functional unit for an Adaptive Dynamic Extensible Processor using a quantitative approach.
Developing an integrated framework for partitioning and mapping custom instructions for the proposed RFU.
ACSAC 2006 - Shanghai, China Kyushu University
Thank you for your attention.