Date post: | 02-Jan-2016 |
Category: |
Documents |
Upload: | adrian-porter |
View: | 25 times |
Download: | 1 times |
Behavioral Application-Dependent Superscalar Core ModelingRicardo Andrés Velásquez
Advisor: Pierre MichaudCo-advisor: André Seznec
Introduction – Simulation
19/4/13Behavioral Application-dependent superscalar core modeling - 2
Some engineering fields allow us to build prototypes identical to the target design.
Computer engineering in contrast makes extensive use of computer simulation to test the boundaries of a design.
731E6 Transistors
Intel Core i7 - 2012
Jack Kilby's original integrated circuit - 1958
1 Transistor
Introduction – Microarchitecture simulation
Slow simulatorsSimulation complexity increases faster than computer performance.
Wider design space explorationComplexity does not allow to rely on intuition.Designers rely on simulators to compare designs.
Multi/Many-core Processors
Complexity “doubles” with every generation.
Research focus on uncore (shared cache, interconnection, main memory, etc.)
19/4/13Behavioral Application-dependent superscalar core modeling - 3
Introduction – Microarchitecture simulation
Various models targeting different objectives.
Behavioral Application-dependent superscalar core modeling 19/4/13 - 4
Simulation speed
Acc
urac
y
Low
High RTLmodels
Cyc.-accurate models
Detailed simulation
Core models
1-IPCmodels
Single core
Multi/many core
Statistical models Empirical
models
Contributions I
1. BADCO modeling technique for approximate simulation of modern superscalar cores.
2. Workload stratification methodology for selecting small and representative multiprogram workloads.
19/4/13Behavioral Application-dependent superscalar core modeling - 5
Simulation time – Detailed simulator
calc
deal
milc
sjeng
nam
dgc
cas
tar
hmm
zeus
leslie
sopl
mcf
0
0.2
0.4
0.6
0.8
1
core uncore
Sim
ula
tio
n T
ime
No
rmal
ized
19/4/13Behavioral Application-dependent superscalar core modeling - 6
Even worse for multicore architectures!!!
Core models
19/4/13Behavioral Application-dependent superscalar core modeling - 7
Functional model / Oracle
Fetch Alloc. Decode Exec. Commit
ITLB IL1 DTLB DL1
Uncore (L2, LLC, MM, Interconnection, etc.)
Tem
pora
l mod
el
Benchmark
Functional model / Oracle
Fetch Alloc. Decode Exec. Commit
ITLB IL1 DTLB DL1
L2
Benchmark
Uncore
Core models
19/4/13Behavioral Application-dependent superscalar core modeling - 8
Functional model
Fetch Alloc. Decode Exec. Commit
ITLB IL1 DTLB DL1
Benchmark
Functional model
Fetch Alloc. Decode Exec. Commit
ITLB IL1 DTLB DL1
Benchmark
Functional model
Fetch Alloc. Decode Exec. Commit
ITLB IL1 DTLB DL1
Benchmark
LLC
Interconnection
Main memory
What if our design target is just the Uncore?
Core 0 Core 1 Core N-1Core 0 Model
Model Simulator
Core 1 Model
Model Simulator
Core N-1 Model
Model Simulator
L2 L2 L2
Core models
Approximate model of a superscalar core that can be connected to a detailed uncore model.
Structural core modelsEmulate internal behavior.Model first order parameters (ROB length, width).Interval Simulation, In-N-Out, etc
Behavioral core modelsEmulate external behavior.Derived from detailed simulation.PDCM, ASPEN, etc.
19/4/13Behavioral Application-dependent superscalar core modeling - 9
Behavioral Core Models
19/4/13Behavioral Application-dependent superscalar core modeling
0 1000 2000 3000 4000 5000 6000
requests responses
0 500 1000 1500 2000 2500 3000
cycles
2 REAL traces of uncore requests – identical instructions.
Requests timing changes in no obvious ways.
Current practices fail to model the timing changes.
Behavioral core models try to reproduce the external behavior of the core.
uncore A
uncore B
- 10
Pairwise Dependent Cache Miss model
(PDCM)K. Lee, S. Evans, and S. Cho, ISPASS 2009.
Trace of retired uops with uncore requests ideal L2
3 kinds of requests: IL1 miss, DL1 load-misses and DL1 store-misses.
Emulate ROB to limit number of parallel requests.
Consider data dependencies between trace items.
SimpleScalar + Perfect branch prediction + no HW-prefetching.
19/4/13Behavioral Application-dependent superscalar core modeling - 11
PDCM – Simulation flow
19/4/13Behavioral Application-dependent superscalar core modeling - 12
Benchmark + core config.
Cyc. Accu. Sim. Zero penalty
Trace simulator
uncoreconfig.
uncoreconfig.
Uncoreconfig.
PDCM trace
Uncore simulator
Performed once for every benchmark
and core config. pairSLOW
Performed once for every uncore configuration
FASTPerformance
PDCM – Model building
19/4/13Behavioral Application-dependent superscalar core modeling - 13
1RT=16
2RT=17
3RT=17
4RT=19
5RT=20
6RT=20
8RT=23Request
uop
Non-request
uop
RT = retirement timeS = number of uopsW = number of cycles
7RT=22
9RT=25
2RT=17
3RT=17
5RT=20
6RT=20
8RT=23
9RT=25
1RT=16
4RT=19
7RT=22
S=3W=171,2,3
S=3W=34,5,6
S=2W=37,8
S=1W=2
9
TraceItem
Data dependencies:Reg + mem
Tuning PDCM to Zesto
19/4/13Behavioral Application-dependent superscalar core modeling
Series1
-30.0
-20.0
-10.0
0.0
10.0
20.0
30.0
7.8 7.2 6.1 5.5 4.6 4.1
CP
I E
rror
(%
)
Average CPI error 4.5 % SimpleScalar vs 7.8 % Zesto.
Considering additional requests increases accuracy.
Zesto is a highly detailed cycle-level simulator Loh et al. ISPASS’09
PDCM++
- 14
+TLB
_miss
es
+writ
e_ba
cks
PDCM
+wro
ng_p
ath
+pre
fetc
h
+del
ayed
_hits
PDCM limitations
Different sources of dependencies:Data dependencies (register & memory).Resource dependencies (queues:LDQ, STQ, etc).
Resource dependencies impact performance.
Long latency accessesContention for resourcesMore request in wrong path
Tracking all sources of dependencies is complex.
19/4/13Behavioral Application-dependent superscalar core modeling - 15
Behavioral application-dependent superscalar core model – BADCO
New core model inspired from PDCM.
Two cycle accurate traces: Null latency T0 same as PDCM.Long latency TL infer dependencies.
Emulate ROB and level-1 MSHRs to limit the number of parallel requests.
Differentiated processing for Instruction request and store requests.
19/4/13Behavioral Application-dependent superscalar core modeling - 16
BADCO Simulation Flow
19/4/13Behavioral Application-dependent superscalar core modeling
Benchmark + core config.
Simulation Zero penalty
Model Building
Simulation Long penalty
BADCO machine
T0
uncoreconfig.
uncoreconfig.
Uncoreconfig.
TL
Model Graph
Uncore simulator
Performed once for every benchmark
and core config. PairSLOW
Performed once for every uncore configuration
FAST
- 17
Trace Generation
Two traces (Zesto) of retired μops.
T0
Level1 cache misses – zero penalty.
μops annotated with retirement time.
Capture fixed cost (W) of μops.
TL
Level1 cache misses – long penalty (1000 cycles).
μops annotated with: issue time (IT), completion time (CT) and uncore requests.
Infer and expose dependencies - capture requests.
19/4/13Behavioral Application-dependent superscalar core modeling - 18
RT=16
IT=9CT=2009
IT=9CT=2009
IT=2010CT=2013
IT=14CT=19
RT=20
dependent independent
W=4
Model Building
19/4/13Behavioral Application-dependent superscalar core modeling
RT=16
RT=17
RT=17
RT=19
RT=20
RT=20
RT=23
IT=9CT=2009
IT=2010CT=2013
IT=14CT=19
IT=2011CT=3012
IT=3014CT=3016
IT=2012CT=2019
IT=3013CT=3021
T0 TLN1
W=16S=1D=0
1N2
W=1S=1
D=N12
N3W=2S=1
D=N14
N3W=2S=1
D=N14
N4W=1S=1
D=N35
N1W=16S=1D=0
1
N1W=16S=2D=01,3
N2W=1S=1
D=N12
N1W=16S=2D=01,3
N2W=1S=1
D=N12
N3W=2S=2
D=N14,6
N1W=16S=2D=01,3
N2W=1S=1
D=N12
N1W=16S=2D=01,3
N2W=1S=1
D=N12
N4W=4S=2
D=N35,7
N3W=2S=2
D=N14,6
N4W=1S=1
D=N35
N1W=16S=2D=01,3
N2W=1S=1
D=N12
Request uop
Non-request
uop
RT = retirement timeIT = issue timeCT = completion time
Request node
Non-request
node
W = weight (cycles)S = size (μops)
1
2
3
4
5
6
7
- 19
1
2
3
4
5
6
7
Fetch Exe. Store
Model Simulation – BADCO Machine
19/4/13Behavioral Application-dependent superscalar core modeling
N1W=17S=4
D(N1)=0------------
ITLBIL1
N7W=4S=5
D(N7)=N6------------
Uncore
N1W=17S=4
D(N1)=0------------
ITLB1IL1
N1W=17S=4
D(N1)=0------------
IL1
N1W=17S=4
D(N1)=0------------
N2W=9S=8
D(N2)=N1------------DTLB1DL1_LDDL1_PF
N2W=9S=8
D(N2)=N1------------DTLB1DL1_LDDL1_PF
N3W=25S=26
D(N3)=0------------
N3W=25S=26
D(N3)=0------------
N5W=50S=56
D(N5)=N1------------
DL1_HoMDL1_HoMDL1_HoM
N2W=9S=8
D(N2)=N1------------DL1_LDDL1_PF
N2W=9S=8
D(N2)=N1------------DL1_LD
N3W=25S=26
D(N3)=0------------
N2W=9S=8
D(N2)=N1------------
N3W=25S=26
D(N3)=0------------
ROB=4ROB=12ROB=38ROB=86ROB=142ROB=138ROB=130ROB=104ROB=0
N4W=51S=48
D(N4)=N2------------DL1_ST
N5W=50S=56
D(N5)=N1------------
ROB=184
N4W=51S=48
D(N4)=N2------------DL1_ST
ROB=136
N6W=73S=80
D(N6)=N4------------DL1_LDDL1_WB
N8W=10S=13
D(N8)=N6------------DL1_LDDL1_LDDL1_PF
N9W=21S=19
D(N9)=N8------------DL1_LDDL1_PF
N10W=50S=56
D(N5)=N1------------
DL1_HoMDL1_HoMDL1_HoM
ITLBIL1 DL1_STDTLB1DL1_LDDL1_PF
Cycle = 0150110011002100310041005101715022002200320112036203720881006
STALL
- 20
Evaluation methodology
• Compare single-core accuracy of PDCM and BADCO with respect to Zesto:
• Quantitative Accuracy (3 core config.)• Relative Accuracy (6 uncore config.)
• Compare simulation speed of PDCM and BADCO for single thread.
• Measure multi-core accuracy and simulation speed of BADCO with respect to Zesto.
19/4/13Behavioral Application-dependent superscalar core modeling - 21
Experimental Setup
19/4/13Behavioral Application-dependent superscalar core modeling
Low(“0”) High(“1”)
L2 size/latency 256 kB / 6 cyc. 1 MB / 8 cyc.
LLC size/latency 2 MB / 8 cyc. 16 MB / 24 cyc.
FSB width 2 bytes 8 bytes
DL1 write buffer 8 entries
L2 64-byte line, 8-way, LRU, write-back, 8 entry write buffer, 16
MSHRs, IP-based stride + next line prefetchers
LLC 64-byte line, 16-way, LRU, write-back, 8 entry write buffer, 16
MSHRs, IP-based stride + stream prefetchers
FSB clock 800 MHz
DRAM latency 200 cycles
Core type Small Medium Big
Decode/issue/commit 3/4/3 3/5/3 4/6/4
RS/LDQ/STQ/ROB 12/12/8/32 18/18/12/64 36/36/24/128
Clock 3GHz
IL1 cache 2 cycles, 32 kB, 4 way, 64-byte line, LRU, next-line prefetcher
ITLB 2 cycles, 128-entry, 4-way, LRU, 4 kB page
DL1 cache 2 cycles, 32 kB, 8-way, 64-byte line, LRU, write-back, IP-based stride + next line
prefetchers
DTLB 2 cycles, 512-entry, 4-way, LRU, 4 kB page
Branch predictor TAGE 4 kB, BTAC 7.5 kB, indirect branch predictor 2 kB, RAS 16 entries
22 SPEC2K6 benchmarks + 2 SPEC2K benchmarks (Vortex & Crafty)
- 22
Quantitative Accuracy – Big core
19/4/13Behavioral Application-dependent superscalar core modeling
zeus le
sl
grom
nam
dgo
bmh2
64 craf
bzip
gcc
asta
sopl
omne
-20
-10
0
10
20
30
PDCM PDCM++ BADCO
SPEC2k6
CP
I er
ror
(%)
- 23
Quantitative Accuracy – Summary
Small Medium Big0
1
2
3
4
5
6
7
8
9
PDCMPDCM++BADCO
Ave
rag
e C
PI
erro
r (%
)
19/4/13Behavioral Application-dependent superscalar core modeling - 24
Relative Accuracy
Design Space Exploration.
Speedup more relevant than absolute performance.
We would like minimum Speedup Error
19/4/13Behavioral Application-dependent superscalar core modeling - 25
Relative AccuracyConfig: 256KB L2, 16MB LLC and 2-byte Bus
19/4/13Behavioral Application-dependent superscalar core modeling
sjen
h264
perlb
povr
aygo
bm deal les
lvo
rtm
cfso
plca
ctas
ta
-10
-5
0
5
10
15
20
PDCM PDCM++ BADCO
SPEC2k6
Sp
eed
up
err
or
(%)
Ref config.: 256KB L2, 2MB LLC and 8-byte Bus. - 26
Relative Accuracy - Summary
19/4/13Behavioral Application-dependent superscalar core modeling
256KB/2MB/2B 256KB/16MB/2B
256KB/16MB/8B
1MB/16MB/2B 1MB/16MB/8B0
1
2
3
4
PDCM PDCM++ BADCO
L2-size/LLC-size/FSB-width
Avg
. sp
eed
up
err
or
(%)
Ref config.: 256KB L2, 2MB LLC and 8-byte Bus. - 27
Simulation Speed
Zesto uncore Core alone0.1
1
10
100
0.17 0.19
2.91
13.04
2.52
8.82
ZestoPDCM++BADCOM
IPS
19/4/13Behavioral Application-dependent superscalar core modeling
(17x) (15x)
(68x)(47x)Speedup
- 28
Multicore simulationestimated CPI vs. measured CPI
19/4/13Behavioral Application-dependent superscalar core modeling - 29
Simulation speed
1 core 2 cores 4 cores 8 cores0.01
0.1
1
10
0.17 0.096 0.049 0.017
2.52 2.411.89
1.19
ZestoBADCOM
IPS
19/4/13Behavioral Application-dependent superscalar core modeling
(14.8x) (25.2x)(38.9x)
(68.1x)
- 30
Behavioral core modeling summary
• Behavioral core models increase simulation speed between one and two orders with respect to detailed simulation.
• PDCM has limitations We introduce BADCO, a new behavioral core model.
• BADCO models are built from two cycle-accurate simulations.
• BADCO is more accurate than PDCM and PDCM++.
19/4/13Behavioral Application-dependent superscalar core modeling - 31
Contributions II
1. BADCO modeling technique for approximate simulation of modern superscalar cores.
2. Workload stratification methodology for selecting small and representative multiprogram workloads.
19/4/13Behavioral Application-dependent superscalar core modeling - 32
Workload design
Select from the workload space a set of representative workloads.
Single-coreWorkload = 1 benchmark.Well established methods (Benchmark design).
Multi-coreWorkload = combination of benchmarks.No standard method for workload selection
19/4/13Behavioral Application-dependent superscalar core modeling - 33
Multiprogram workload selection
The number “W” of possible multiprogram workloads:
For 29 SPEC-CPU benchmarks
19/4/13Behavioral Application-dependent superscalar core modeling
B num. benchmarksK num. cores
- 34
2 4 8 161.00E+02
1.00E+04
1.00E+06
1.00E+08
1.00E+10
1.00E+12
4.35E+02
3.60E+04
3.03E+07
4.17E+11
cores
Impossible to simulate all possible benchmark combinations
Current practices I
Survey 2007 – 2012 (ISCA, MICRO and HPCA)
75 papers
9/75 random sampling.Arbitrary sample size.
66/75 class-based selection.Benchmark classes selected manually.Define workload types.Diverse practices to select workloads.Arbitrary sample size.
19/4/13Behavioral Application-dependent superscalar core modeling - 35
Current practices II
“Interesting Sample” High degree of subjectivity.
Sample may be interesting but it may not be representative of the population.
Caution to make general conclusion.
19/4/13Behavioral Application-dependent superscalar core modeling - 36
Representative sample?
Probability that a characteristic of the population is kept for the sample totally or with certain tolerance.
Example characteristics:Global throughput.Global speedup.Global ranking of microarchitectures.
Which of two microarchitecture is better?
TARGET: define a question that you want to ask to the sample and then look for a way to answer that question.
19/4/13Behavioral Application-dependent superscalar core modeling - 37
Methodology
TARGET: Small representative sampleCorrect ranking of two microarchitectures.
Case study= 5 shared-cache replacement policiesLRU, RANDOM, FIFO, DIP and DRRIP.
Use approximate simulation (BADCO)All benchmark combinations (2 & 4 cores).10000 workloads for 8 cores.
Study random sampling.Analytical model to compute sample size.
Study alternative sampling methods.
19/4/13Behavioral Application-dependent superscalar core modeling - 38
Random Sampling
All workloads have the same probability to be selected.
Safe way to avoid biases if the sample is big enough.
Lends itself to analytical modeling.
19/4/13Behavioral Application-dependent superscalar core modeling - 39
Analytical Model
What we want from the random sample is to know whether or not a microarchitecture Y is better than X.
tY(w) and tX(w) per-workload throughput of Y and X.
TY and TX average throughput
We define the following random variable:
19/4/13Behavioral Application-dependent superscalar core modeling
d(w) is the per-workload throughput difference.D is the average throughput difference.
- 40
Analytical Model
Central limit theorem sample throughput D can be approximated by a normal distribution.
The degree of confidence that Y is better than X is equal to the probability that D is greater than zero.
Assuming almost 100% confidence and after some math we have
Where W is the sample size and cv is the coefficient of variation of d(w).
19/4/13Behavioral Application-dependent superscalar core modeling - 41
Coefficient of variation
The coefficient of variation (CV) is a normalized measure of dispersion of a probability distribution.
19/4/13Behavioral Application-dependent superscalar core modeling
-10 -8 -6 -4 -2 0 2 4 6 8 10
CV=1CV=2CV=10
- 42
Estimate CV = compute sample size.
σ=0.5, μ=0.5
σ=1, μ=0.5
σ=5, μ=0.5
CV Estimation: 4 cores – WSU
LRU>RND
LRU>FIFO
LRU>DIP
LRU>DRRIP
RND>FIFO
RND>DIP
RND>DRRIP
FIFO
>DIP
FIFO
>DRRIP
DIP>DRRIP
-1.5
-1
-0.5
0
0.5
1
1.5
Sam.-Zesto Sam.-BADCO Pop.-BADCO
1/C
V
19/4/13Behavioral Application-dependent superscalar core modeling - 43
Random sampling model validation
19/4/13Behavioral Application-dependent superscalar core modeling
Experimental confidence vs. model confidence that “DRRIP outperforms DIP” using WSU
- 44
Can we do better?
Explore alternative sampling techniques:Balanced random sampling.Stratified sampling: benchmark classes. per-workload throughput.
19/4/13Behavioral Application-dependent superscalar core modeling - 45
LRU>RND
LRU>FIFO
LRU>DIP
LRU>DRRIP
RND>FIFO
RND>DIP
RND>DRRIP
FIFO
>DIP
FIFO
>DRRIP
DIP>DRRIP
1
10
100
1000
10000
IPCTWSUHSU
Sam
ple
Siz
e 4 cores
Big Samples
Balanced Random Sampling
Each benchmark occurs the same number of times in the whole workload population.
Balanced random each benchmark occurs the same number of times in the sample.
Probability of selecting a workload depends on the previous workloads selected.
No mathematical model.
19/4/13Behavioral Application-dependent superscalar core modeling - 46
Stratified Random Sampling
Classical sampling method.
Exploit homogeneities.
Divide the population in non-overlapping subsets (strata).
Take random samples in each strata.
Sample throughput is a weighted average.
We study 2 variants: Benchmark stratification.Workload stratification.
19/4/13Behavioral Application-dependent superscalar core modeling - 47
Benchmark stratification
Attempt to formalize common practices.Class-based selection.
Divide benchmarks in classes. Group benchmarks with similar behavior.
Build strata for every combination of classes.
For example: three classes in a 4 core machine generates 15 strata (004, 013, 022, 112, …)
19/4/13Behavioral Application-dependent superscalar core modeling
Class Benchmarks
Low misses povray, gromacs, milc, calculix, namd, dealII, perlbench, gobmk, h264ref, hmmer, sjeng
Medium misses bzip2, gcc, astar, zeusmp, cactusADM
High misses libquantum, omenetpp, leslie3d, bwaves, mcf, soplex
- 48
Workload stratification
Estimate per-workload throughput:Approximate simulator (BADCO).Large sample (>800 workloads).
Measure per-workload throughput difference d(w).
Sort the workloads according to d(w).
Use a cluster algorithm to group workloads in strata.
19/4/13Behavioral Application-dependent superscalar core modeling - 49
Alternative Sampling MethodsDIP > LRU, IPCT, 4 cores
19/4/13Behavioral Application-dependent superscalar core modeling - 50
CV= 10.86
10 20 30 40 50 60 80 100
120
140
160
180
200
300
400
500
600
700
8000.5
0.6
0.7
0.8
0.9
1
random bal-random bench-strata workload-strata
Co
nfi
den
ce
Alternative Sampling MethodsDDRIP > LRU, IPCT, 4 cores
19/4/13Behavioral Application-dependent superscalar core modeling
CV=2.70
- 51
10 20 30 40 50 60 80 100 120 140 1600.6
0.7
0.8
0.9
1
random bal-random bench-strata workload-strata
Co
nfi
den
ce
Practical guidelines
Method intended for incremental modification of a microarchitecture.
Estimate Cv:From the sample increase sample if necessary.Use approximate simulator.
IFCv > 10 same average performance.Cv in [2,10] workload stratification.Cv < 2 Random sampling.
If workload stratification:Sample greater than 800 workloads.Approx. simulator with good relative accuracy.
19/4/13Behavioral Application-dependent superscalar core modeling - 52
Conclusions I
• Improve the simulation speed of multicore processors sacrificing some core accuracy
• Behavioral core model Target the uncore.
• BADCO = Two traces + Infer dependencies.
• One to two orders of magnitude faster than Zesto.
• Average CPI error of less than 5% with respect to Zesto.
19/4/13Behavioral Application-dependent superscalar core modeling - 53
Conclusions II
• Current practices NON representative sample.
• Interesting workloads ≠ representative workloads.
• First steps towards defining representative samples.
• Important!!! to define sample representativeness.
• Analytical model ranking correctly two microarchitectures.
• Alternative sampling method Workload stratification. Requires approximate simulation.
19/4/13Behavioral Application-dependent superscalar core modeling - 54
Future work
• BADCO models for multi-thread programs?
• BADCO models for studying energy efficiency and heterogeneous architectures.
• Alternatives ways of representativeness.• Analytical models to compute sample size.
• Phase behavior of benchmarks.
• Clustering methods for workload stratification.
19/4/13Behavioral Application-dependent superscalar core modeling - 55
THANKS FOR YOUR ATTENTION!!!
QUESTIONS?
19/4/13Behavioral Application-dependent superscalar core modeling
ALF
- 56