Post on 11-Apr-2018
transcript
Karu Sankaralingam University of Wisconsin-Madison
Collaborators: Hadi Esmaeilzadeh, Emily Blem, Renee St. Amant, and Doug Burger
The Dark Silicon Implications for Microprocessors
Multicore Decade?
2
We have relied on multicore scaling for over five years.
2000 2005 2010 2015 Pentium Extreme
Dual-Core
Core 2 Quad-Core
i7 980x Hex-Core
?
Multicore Decade?
3
We have relied on multicore scaling for over five years.
How much longer will it be our primary performance scaling technique?
2000 2005 2010 2015 Pentium Extreme
Dual-Core
Core 2 Quad-Core
i7 980x Hex-Core
?
Finding Optimal Multicore Designs
4
Comprehensive design space: Fixed area budget Fixed power budget Two sets of CMOS scaling projections Optimal core and diverse multicore organizations Parallel benchmarks
For next 5 technology generations, we find the best performing multicore from a comprehensive design space search for each of the PARSEC benchmarks
Symmetric Multicore Projections
5 Symmetric multicores alone will not sustain the multicore era.
0
4
8
12
16
20
0 2 4 6 8 10
Spee
dup
Year
Target
Symmetric
3.4x in 10 years
18x
Multicore Solutions
6
Asymmetric Topologies
0
4
8
12
16
20
0 2 4 6 8 10
Spee
dup
Year
Target Symmetric Asymmetric
0
4
8
12
16
20
0 2 4 6 8 10
Spee
dup
Year
Target Symmetric
3.5x
Multicore Solutions
7
Dynamic Topologies
[Chakraborty (2008), Suleman et al (2009)]
0
4
8
12
16
20
0 2 4 6 8 10
Spee
dup
Year
Target Symmetric Asymmetric Dynamic
0
4
8
12
16
20
0 2 4 6 8 10
Spee
dup
Year
Target Symmetric Asymmetric
Multicore Solutions
8
Dynamic Topologies
[Chakraborty (2008), Suleman et al (2009)]
0
4
8
12
16
20
0 2 4 6 8 10
Spee
dup
Year
Target Symmetric Asymmetric Dynamic
0
4
8
12
16
20
0 2 4 6 8 10
Spee
dup
Year
Target Symmetric Asymmetric
Multicore Solutions
9
Dynamic Topologies
[Chakraborty (2008), Suleman et al (2009)]
0
4
8
12
16
20
0 2 4 6 8 10
Spee
dup
Year
Target Symmetric Asymmetric Dynamic
0
4
8
12
16
20
0 2 4 6 8 10
Spee
dup
Year
Target Symmetric Asymmetric
Multicore Solutions
10
Dynamic Topologies
[Chakraborty (2008), Suleman et al (2009)]
0
4
8
12
16
20
0 2 4 6 8 10
Spee
dup
Year
Target Symmetric Asymmetric Dynamic
0
4
8
12
16
20
0 2 4 6 8 10
Spee
dup
Year
3.5x
0
4
8
12
16
20
0 2 4 6 8 10
Spee
dup
Year
Target Symmetric Asymmetric Dynamic
Multicore Solutions
11
Composed/Fused Topologies
[Ipek et al (2007), Kim et al (2007)]
0
4
8
12
16
20
0 2 4 6 8 10
Spee
dup
Year
Target Symmetric Asymmetric Dynamic Composed
0
4
8
12
16
20
0 2 4 6 8 10
Spee
dup
Year
Target Symmetric Asymmetric Dynamic
Multicore Solutions
12
Composed/Fused Topologies
[Ipek et al (2007), Kim et al (2007)]
0
4
8
12
16
20
0 2 4 6 8 10
Spee
dup
Year
Target Symmetric Asymmetric Dynamic Composed
0
4
8
12
16
20
0 2 4 6 8 10
Spee
dup
Year
Target Symmetric Asymmetric Dynamic
Multicore Solutions
13
Composed/Fused Topologies
[Ipek et al (2007), Kim et al (2007)]
0
4
8
12
16
20
0 2 4 6 8 10
Spee
dup
Year
Target Symmetric Asymmetric Dynamic Composed
0
4
8
12
16
20
0 2 4 6 8 10
Spee
dup
Year
Target
Symmetric
Asymmetric
Dynamic
3.7x
Multicore Solutions
14
GPU-Style Cores
0
4
8
12
16
20
0 2 4 6 8 10
Spee
dup
Year
Target Symmetric Asymmetric Dynamic Composed GPU
0
4
8
12
16
20
0 2 4 6 8 10
Spee
dup
Year
Target Symmetric Asymmetric Dynamic Composed
2.7x
Multicore Era Projections
15
0
4
8
12
16
20
0 2 4 6 8 10
Spee
dup
Year
Target
Composed Composed
3.7x
18x
Multicore Era Projections
16
The best designs speed up 14% per year rather than the recent trend of 34% per year
0
4
8
12
16
20
0 2 4 6 8 10
Spee
dup
Year
Target
Composed Composed
3.7x
18x
Why Diminishing Returns?
17
Transistor area is still scaling Voltage and capacitance scaling have slowed Result: designs are power, not area, limited
Devices • Find the best case technology scaling
Cores • Find the best cores
Multicores • Find the best multicore organization
Projections • Predict best case multicore performance for
each technology generation
Overview
18
Conservative Optimistic
Area 32x 32x
Power 4.5x 8.3x
Frequency 1.3x 3.9x
[Borkar 2007] [ITRS 2010]
Device Scaling Projections
19
From 45 nm to 8 nm:
Modeling Ideal Core Power/Perf.
21
Atom
Nehalem
0
5
10
15
20
25
30
0 5 10 15 20 25 30 35 40
Pow
er (T
DP,
Wat
ts)
SPECmark Score
Intel Nehalem AMD Shanghai Intel Core Intel Atom
Modeling Ideal Core Power/Perf.
22
Atom
Nehalem
0
5
10
15
20
25
30
0 5 10 15 20 25 30 35 40
Pow
er (T
DP,
Wat
ts)
SPECmark Score
Intel Nehalem AMD Shanghai Intel Core Intel Atom
0
5
10
15
20
25
30
0 10 20 30 40
Pow
er (T
DP,
Wat
ts)
SPECmark Score
Intel Nehalem AMD Shanghai Intel Core Intel Atom
Modeling Ideal Core Power/Perf.
23
Atom
Nehalem
0
5
10
15
20
25
30
0 10 20 30 40
Pow
er (T
DP,
Wat
ts)
SPECmark Score
Intel Nehalem AMD Shanghai Intel Core Intel Atom
0
5
10
15
20
25
30
0 10 20 30 40
Pow
er (T
DP,
Wat
ts)
SPECmark Score
Intel Nehalem AMD Shanghai Intel Core Intel Atom
Modeling Ideal Core Power/Perf.
24
Atom
Nehalem
0
5
10
15
20
25
30
0 10 20 30 40
Pow
er (T
DP,
Wat
ts)
SPECmark Score
Intel Nehalem AMD Shanghai Intel Core Intel Atom
Pareto Frontier includes all optimal power/performance points
0
5
10
15
20
25
30
0 10 20 30 40
Pow
er (T
DP,
Wat
ts)
SPECmark Score
Intel Nehalem AMD Shanghai Intel Core Intel Atom
Modeling Ideal Core Power/Perf.
25
Atom
Nehalem
0
5
10
15
20
25
30
0 10 20 30 40
Pow
er (T
DP,
Wat
ts)
SPECmark Score
Intel Nehalem AMD Shanghai Intel Core Intel Atom
Pareto Frontier includes all optimal power/performance points
Repeat using core area for optimal area/performance points
0
5
10
15
20
25
30
0 10 20 30 40
Pow
er (T
DP,
Wat
ts)
SPECmark Score
Combining Device and Core Models
26
45 nm Frontier
32 nm Frontier
Device Scaling
Devices • Find the best case technology scaling
Cores • Find the best cores
Multicores • Find the best multicore organization
Projections • Predict best case multicore performance for
each technology generation
Overview
27
What belongs in multicore model?
28
Styles
Applications
Topologies
Area & Power / Performance Tradeoffs
Architectures
PARSEC fparallel, Data Use
Number of Threads, Cache Sizes
Area & Power Budget
Cache & memory latencies, memory bandwidth
Pareto Frontiers
Multicore Speedup Model
29
Multicore Speedup
1 1-fparallel
Serial Speedup fparallel
Parallel Speedup
= +
Multicore Performance Model
30
Performance is limited by:
and
Memory bandwidth BWmax / (instructions per byte from memory)
Computation Ncores × (core frequency/CPIexe) × core utilization
[Guz et al, 2009]
Core Utilization Model
31
Core utilization is limited by:
Fraction of Time Core is Ready to Issue Number of Threads in Core / Number of Threads to Keep Busy
[Guz et al, 2009]
Translating from SPECmark
33
1. From q, find core’s SPECmark speedup
2. Frequency linearly distributed from Atom to Nehalem
3. Recall: model predicts benchmark performance as f(benchmark chars, frequency, CPIexe)
4. Compute CPIexe such that Benchmark Speedup = SPECmark Speedup
Area and Power Constraints
34
Ncores x A(q) ≤ Area Budget
Ncores x P(q) ≤ Power Budget
Dark silicon = Ncores / # of cores that fit in chip area
Devices • Find the best case technology scaling
Cores • Find the best cores
Multicores • Find the best multicore organization
Projections • Predict best case multicore performance for
each technology generation
Overview
35
Dark Silicon
36
0%
20%
40%
60%
80%
100%
blacksholes bodytrack canneal ferret streamcluster GM
Perc
enta
ge D
ark
Silic
on
ITRS Conservative
0%
20%
40%
60%
80%
100%
blacksholes bodytrack canneal ferret streamcluster GM
Perc
enta
ge D
ark
Silic
on
ITRS Conservative
Sources of Dark Silicon: Power + Limited Parallelism
At 22 nm: At 8 nm:
17% 26%
51%
71%
0
4
8
12
16
20
0 2 4 6 8 10
Spee
dup
Year
ITRS: All Topologies ITRS: Symmetric Conservative: All Topologies Conservative: Symmetric
0
4
8
12
16
20
0 2 4 6 8 10
Spee
dup
Year
Conservative: Symmetric
0
4
8
12
16
20
0 2 4 6 8 10
Spee
dup
Year
Conservative: All Topologies Conservative: Symmetric
0
4
8
12
16
20
0 2 4 6 8 10
Spee
dup
Year
ITRS: All Topologies ITRS: Symmetric Conservative: All Topologies Conservative: Symmetric
Overall Performance
37
Target
fparallel = 0.99
18x 16x
8x 6x 3x
Conclusions Multicore performance gains are limited
Need at least 18%-40% per generation from architecture alone without additional power
38
Unicore Era Multicore Era
?