Application-Specific Customization of Soft Processor MicroarchitecturePeter YiannacourasJ. Gregory Steffan Jonathan Rose
University of Toronto
Edward S. Rogers Sr. Department of Electrical and Computer Engineering
2
Processors and FPGA Systems
We seek improvement through customization
Processors lie at the “heart” of FPGA systems
MemoryInterface
UART
Custom Logic
Ethernet
Performs coordination and even computation Better processors => less hardware to design
Soft Processor
3
Motivating Application-Specific Customizations of Soft Processors
1. FPGA Configurability Can consider unlimited processor variants
2. A soft processor might be used to run either:a) A single applicationb) A single class of applicationsc) Many applications, but can be reconfigured
3. Applications differ in architectural requirements Can specialize architecture for each application
We want to evaluate effectiveness of specialization
4
Research Goals
To investigate1. The potential for “Application-tuning”
Tune processor microarchitecture to favour an application Preserve general purpose functionality
2. “Instruction-set Subsetting” Sacrifice general purpose functionality Eliminate hardware not required by application
3. Combination of both methods
Measure efficiency gained through real implementations
5
SPREE
SPREE System (Soft Processor Rapid Exploration Environment)
RTL
ISA Datapath ■ Input: Processor description■ Made of hand-coded components
1. Verify ISA against datapath
2. Datapath Instantiation3. Control Generation
■ Multi-cycle/variable-cycle FUs■ Multiplexer select signals■ Interlocking■ Branch handling
■ SPREE System
■ Output: Synthesizable Verilog
ProcessorDescription
6
Back-End Infrastructure
RTL
2. Resource Usage3. Clock Frequency4. Power
1. Cycle Count
Quartus II 4.2CAD Software
ModelsimRTL Simulator
Benchmarks(MiBench,
Dhrystone 2.1,RATES,XiRisc)
Stratix 1S40C5
We can measure area/performance/energy accurately
7
Comparison to Altera’s Nios II
Has three variations:Nios II/e – unpipelined, no HW multiplierNios II/s – 5-stage, with HW multiplierNios II/f – 6-stage, dynamic branch prediction
8
Architectural Parameters Used in SPREE
We focus on core microarchitecture
Multiplication Support Hardware FU or software routine
Shifter implementation Flipflops, multiplier, or LUTs
Pipelining Depth
(2-7 stages)
Organization Forwarding
9
300
500
700
900
1100
1300
1500
1700
1900
500 700 900 1100 1300 1500 1700 1900
Area (Equivalent LEs)
Ge
om
ea
n W
all
Clo
ck
Tim
e (
us
) SPREE Processors
Altera Nios II/e
Altera Nios II/s
Altera Nios II/f
SPREE vs Nios II
smaller
faster
-3-stage pipe-HW multiply-Multiply-based shifter
10
Exploration of Soft Processor Architectural Customizations
1. Architectural-tuning
2. Instruction-set subsetting
3. Combination (Arch-tuning + Subsetting)
11
1. Architectural Tuning Experiment
Vary the same parameters Multiplication Support Shifter implementation Pipelining
Determine1. Best overall (general purpose) processor
2. Best per application (application-tuned)
Metric: Performance per Area (MIPS/LE) Basically inverse of Area-Delay product
12
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08b
ub
ble
_so
rt
crc
de
s fft fir
qu
an
t
iqu
an
t
turb
o
vlc
bitc
nts
CR
C3
2
qso
rt
sha
stri
ng
sea
rch
FF
T
dijk
stra
pa
tric
ia go
l
dct
dh
ry
OV
ER
AL
L
Benchmark
Pe
rfo
rma
nc
e p
er
Un
it A
rea
(M
IPS
/LE
)
Performance per Area of All Processors
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08b
ub
ble
_so
rt
crc
de
s fft fir
qu
an
t
iqu
an
t
turb
o
vlc
bitc
nts
CR
C3
2
qso
rt
sha
stri
ng
sea
rch
FF
T
dijk
stra
pa
tric
ia go
l
dct
dh
ry
OV
ER
AL
L
Benchmark
Pe
rfo
rma
nc
e p
er
Un
it A
rea
(M
IPS
/LE
)
General Purpose
serialshift_norise
serialshift_lowrise
serialshift_judicialrise
serialshift_highrise
mulshift_norise
mulshift_minrise
mulshift_highrise
barrelshift_norise
barrelshift_minrise
barrelshift_highrise
serialshiftdatamem_norisepipe3_serialshift
pipe3_mulshift
pipe3_barrelshift
pipe4_serialshift
pipe4_mulshift
pipe4_barrelshift
pipe4_serialshift_2
pipe4_mulshift_stall_2
pipe4_barrelshift_2
pipe5_serialshift
pipe5_mulshift
pipe5_barrelshift
pipe5_serialshift_load
pipe5_mulshift_load
pipe5_mulshift_stall_loadpipe5_barrelshift_load
pipe7_serialshift
pipe7_mulshift
pipe7_barrelshift
pipe7_barrelshift_2
barrelshift_norise.nomul
barrelshift_minrise.nomulbarrelshift_highrise.nomulserialshift_norise.nomul
serialshift_lowrise.nomulserialshift_judicialrise.nomulserialshift_highrise.nomulmulshift_norise.nomul
mulshift_minrise.nomul
mulshift_highrise.nomul
serialshiftdatamem_norise.nomulpipe3_serialshift.nomul
pipe3_mulshift.nomul
pipe3_mulshift_stall.nomulpipe3_barrelshift.nomul
pipe4_serialshift.nomul
pipe4_mulshift.nomul
pipe4_barrelshift.nomul
pipe5_serialshift.nomul
pipe5_mulshift.nomul
pipe5_barrelshift.nomul
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08b
ub
ble
_so
rt
crc
de
s fft fir
qu
an
t
iqu
an
t
turb
o
vlc
bitc
nts
CR
C3
2
qso
rt
sha
stri
ng
sea
rch
FF
T
dijk
stra
pa
tric
ia go
l
dct
dh
ry
OV
ER
AL
L
Benchmark
Pe
rfo
rma
nc
e p
er
Un
it A
rea
(M
IPS
/LE
)
General Purpose
Application-tuned
serialshift_norise
serialshift_lowrise
serialshift_judicialrise
serialshift_highrise
mulshift_norise
mulshift_minrise
mulshift_highrise
barrelshift_norise
barrelshift_minrise
barrelshift_highrise
serialshiftdatamem_norisepipe3_serialshift
pipe3_mulshift
pipe3_barrelshift
pipe4_serialshift
pipe4_mulshift
pipe4_barrelshift
pipe4_serialshift_2
pipe4_mulshift_stall_2pipe4_barrelshift_2
pipe5_serialshift
pipe5_mulshift
pipe5_barrelshift
pipe5_serialshift_load
pipe5_mulshift_load
pipe5_mulshift_stall_loadpipe5_barrelshift_loadpipe7_serialshift
pipe7_mulshift
pipe7_barrelshift
pipe7_barrelshift_2
barrelshift_norise.nomulbarrelshift_minrise.nomulbarrelshift_highrise.nomulserialshift_norise.nomulserialshift_lowrise.nomulserialshift_judicialrise.nomulserialshift_highrise.nomulmulshift_norise.nomul
mulshift_minrise.nomulmulshift_highrise.nomulserialshiftdatamem_norise.nomulpipe3_serialshift.nomulpipe3_mulshift.nomul
pipe3_mulshift_stall.nomulpipe3_barrelshift.nomulpipe4_serialshift.nomulpipe4_mulshift.nomul
pipe4_barrelshift.nomulpipe5_serialshift.nomulpipe5_mulshift.nomul
pipe5_barrelshift.nomul
14.1%
32%
13
2. Instruction-set Subsetting
SPREE automatically removesUnused connectionsUnused components
Reduce processor by reducing the ISACan create application-specific processor
Eliminate unused parts of the ISA
14
Instruction-set Usage of Benchmarks
Applications do not use complete ISA
0%
25%
50%
75%
100%
bu
bb
le_
so
rt
crc
de
s fft
fir
qu
an
t
iqu
an
t
turb
o
vlc
bit
cn
ts
CR
C3
2
qs
ort
sh
a
str
ing
se
arc
h
FF
T
dijk
str
a
pa
tric
ia
go
l
dc
t
dh
ry
AV
ER
AG
E
Pe
rce
nt
of
ISA
Us
ed
Strong potential for hardware reduction
15
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
bubb
le_s
ort crc
des fft fir
quan
t
iqua
nt
turb
o vlc
bitc
nts
CR
C32
qsor
t
sha
strin
gsea
rch
FF
T_M
I
dijk
stra
patr
icia go
l
dct
dhry
AV
ER
AG
E
Benchmark
Fra
ctio
n o
f Are
aF
ract
ion
of
Are
aArea Reduction from Subsetting
Area reduced by 60% in some, 23% on average
23%
Similar reductions for energy, small impact on performance
16
3. Combining Application Tuning and Instruction-set Subsetting
Subsetting is effective on its own Can apply subsetting on top of tuning
Compare different customization methods1. Tuning
2. Subsetting
3. Tuning + Subsetting
17
Combining Application Tuning and Instruction-set Subsetting
0
10
20
30
40
50
60
General purpose Application-tuned General purpose +subsetted
Application-tuned +subsetted
Pe
rfo
rma
nc
e P
er
Are
a (
MIP
S/1
00
0L
Es
)
14% 16%25%
Tuning reduces the waste that subsetting eliminates
18
Summary of Presented Architectural Conclusions
Application tuning 14% average efficiency gainWill increase with more architectural axes
Instruction-set SubsettingUp to 60% area & energy savings16% average efficiency gain
Combined Tuning & Subsetting25% average efficiency gain
19
Future Work
Consider other promising architectural axes Branch prediction, aggressive forwarding ISA changes Datapaths (eg. VLIW) Caches and memory hierarchy
Compiler assistance Can improve tuning & subsetting