Date post: | 01-Apr-2015 |
Category: |
Documents |
Upload: | adam-grapes |
View: | 234 times |
Download: | 3 times |
Design Space Explorationwith SimpleScalar
Vittorio Zaccaria – Alari @ ST 2001
The SimpleScalar Toolset
Vittorio Zaccaria – Alari @ ST 2001
The Simplescalar Toolset
Vittorio Zaccaria – Alari @ ST 2001
Simluation Suite
Vittorio Zaccaria – Alari @ ST 2001
SimpleScalar ISA• clean and simple instruction set
architecture:• MIPS/ DLX + more addressing modes -
delay slots• 64- bit inst encoding facilitates instruction
set research• 16- bit space for hints, new insts, and
annotations• four operand instruction format, up to 256
registers
Vittorio Zaccaria – Alari @ ST 2001
SimpleScalar Architected State
Vittorio Zaccaria – Alari @ ST 2001
Out of order simulator
Configurable set ofFUs
Vittorio Zaccaria – Alari @ ST 2001
Configurable Memory Hierarchy• All caches and TLB
configurations specified with same format:
< nsets>:< bsize>:< assoc>:< repl>
• Block replacement policyl - for LRUf - for FIFOr - for RANDOM
Vittorio Zaccaria – Alari @ ST 2001
Configurable Memory Hierarchy
Vittorio Zaccaria – Alari @ ST 2001
Design Space Exploration• Metric definition
• Energy*Delay• Area*Delay
• Design space definition• L1 and L2 caches, n° ALUs ...
• Embedded Application Definition• Metric minimization
• Exhaustive search• Greedy search• Gradient search • Simulated Annealing and so on
Vittorio Zaccaria – Alari @ ST 2001
Design Space Exploration:A case study.• Metric Defined:
Price over Performance= area*CPI• Design space:
• Sets, block, associativity and replacement polocy for each cache;
• number of integer ALUs;• number of integer multipliers;• number of floating-point ALUs;• number of floating-point multipliers.
Design space exploration performed by F. Cassoli and A. Ferrante @ ALARI
Vittorio Zaccaria – Alari @ ST 2001
Design Space Definition• Ranges for each parameter
• DL1:128:{32, 64}:4:L• IL1:{256, 512}:32:1:L• UL2:{1024, 2048}:{64, 128}:4:{L, F}• IALU:{2, 4}• IMULT:{1, 2, 4}• FPALU:{1, 4}• FPMULT:{1, 2}
• 768 different cases
Vittorio Zaccaria – Alari @ ST 2001
Embedded Application
• EPIC decoder (Efficient Pyramid Image deCoder) • Image data compression utility written
in C.• Free Mediabench Source• Based on wavelet decomposition and a
Huffman entropy (de)coder.
Vittorio Zaccaria – Alari @ ST 2001
Cost Function
F(x)= A(x)*D(x)• Area of x (sum of equivalent gates of
each module). Models found in the literature.
• Delay of x (computed through simulation of EPIC on architecture x).
Vittorio Zaccaria – Alari @ ST 2001
Result of the exhaustive search
9.E+05
1.E+06
1.E+06
2.E+06
2.E+06
2.E+06
2.E+06
2.E+06
3.E+06
3.E+06
2 102 202 302 402 502 602 702
DL1-# of sets = 32 DL1-# of sets = 64
Vittorio Zaccaria – Alari @ ST 2001
Optimal Configuration• The lowest value of the PoP is 998’732.31,
obtained with:DL1: 128:32:4:LIL1: 256:32:1:LUL2: 1024:64:4:FIALU: 4IMULT: 2FPALU: 4FPMULT: 2
Vittorio Zaccaria – Alari @ ST 2001
Cost Function Properties
• The difference between the PoPs for a DL1 cache of 32 and of 64 sets is very little.
• The difference between the PoPs for a IL1 cache of 256 and of 512 sets is very little.
Vittorio Zaccaria – Alari @ ST 2001
0.E+00
5.E+05
1.E+06
2.E+06
2.E+06
3.E+06
3.E+06
1 51 101 151
UL2-# of sets = 1024 UL2-# of sets = 2048
UL2-dim of block = 64 UL2-dim of block = 128
UL2-dim of block = 64 UL2-dim of block = 128
Vittorio Zaccaria – Alari @ ST 2001
Cost Function Properties• Increasing the sets of UL2 increases the
PoP (in average). • Augmenting the dimension of the block of
the UL2 cache always leads to an abrupt growth of the PoP.
• The L2-cache dimension grows very much, so that the cache becomes significantly larger that the rest of the system.
Vittorio Zaccaria – Alari @ ST 2001
Cost Function Properties
9.80E+05
1.00E+06
1.02E+06
1.04E+06
1.06E+06
1.08E+06
1.10E+06
1.12E+06
1.14E+06
1.16E+06
1.18E+06
1 11 21 31 41
# IALU = 2 # IALU = 4
Vittorio Zaccaria – Alari @ ST 2001
Cost Function Properties
9.95E+05
1.00E+06
1.01E+06
1.01E+06
1.02E+06
1.02E+06
1.03E+06
1.03E+06
1.04E+06
1.04E+06
25 30 35 40 45
# IMULT = 1 # IMULT = 2 # IMULT = 4
FPALU = 1 FPALU = 4
Vittorio Zaccaria – Alari @ ST 2001
Cost Function Properties
9.98E+05
9.99E+05
1.00E+06
1.00E+06
1.00E+06
1.00E+06
1.00E+06
1.01E+06
1.01E+06
1.01E+06
1.01E+06
37 38 39 40
FPMULT = 1; UL2 type = L
FPMULT = 1; UL2 type = F
FPMULT = 2; UL2 type = L
FPMULT = 2; UL2 type = F
Vittorio Zaccaria – Alari @ ST 2001
Area – CPI scatter plot
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1200000 1700000 2200000 2700000 3200000
Area
CP
I
Vittorio Zaccaria – Alari @ ST 2001
Conclusions• Reduction of PoP when the number of
integer ALUs is doubled. Great benefit with reduced area increase.
• Optimal configuration has IMULT = 2, (not 1 or 4, because EPIC does not expose much parallelism).
• However FPALU = 4 leads to better results than FPALU = 1.
• L2 FIFO policy outperforms LRU.• Same benefits when adding an FPMULT.
Vittorio Zaccaria – Alari @ ST 2001
Conclusions• A greedy algorithm has also been applied
to minimize the cost function.• Starting from different points
• average number of simulations required= 49• minimum number of simulations required= 11• maximum number of simulations required=83
• Full search optimum always reached• Considering that an exhaustive search
needs 768 simulations, we reduce time of about 93.6%.