Date post: | 30-Mar-2015 |
Category: |
Documents |
Upload: | darrius-rodd |
View: | 213 times |
Download: | 0 times |
Complementing User-Level Coarse-Grain Parallelism
with Implicit Speculative Parallelism
Nikolas Ioannou, Marcelo Cintra
School of InformaticsUniversity of Edinburgh
Intl. Symp. on Microarchitecture - December 2011 2
Introduction
Source: Intel
Multi-cores and many-cores here to stay
Intl. Symp. on Microarchitecture - December 2011 3
Introduction
Multi-cores and many-cores are here to stay Parallel programming is essential to realize
potential Focus on coarse-grain parallelism Weak or no scaling of some parallel applications Can we exploit under-utilized cores to complement
coarse-grain parallelism?– Nested parallelism in multi-threaded applications– Exploit it using implicit speculative parallelism
Intl. Symp. on Microarchitecture - December 2011 4
Contributions
Evaluation of implicit speculative parallelism on top of explicit parallelism to improve scalability:– Improve scalability by 40% on avg.– Same energy consumption
Detailed analysis of multithreaded scalability:– Performance bottlenecks– Behavior on different input datasets
Auto-tuning to dynamically select the number of explicit and implicit threads
Intl. Symp. on Microarchitecture - December 2011 5
Outline
Introduction Motivation Proposal Evaluation Methodology Results Conclusions
Intl. Symp. on Microarchitecture - December 2011 6
Bottlenecks: Large Critical Sections
T0 T1 T2 T3
Tim
e
0 20 40 60Cores
0
1
2
3
Sp
ee
du
p
2 4 8 16 32 64Cores
0
0.2
0.4
0.6
0.8
1.0
1.2
Nor
m. E
xecu
tion
Tim
e
BusyLockBarrier
Integer Sort (IS) NASPB
Intl. Symp. on Microarchitecture - December 2011 7
Bottlenecks: Load Imbalance
T0 T1 T2 T3
Tim
e
0 20 40 60 80 100 120Cores
0
5
10
15
20
Spe
edup
2 4 8 16 32 64 128Cores
0
0.1
0.2
0.3
0.4
0.5
0.6
Nor
m. E
xecu
tion
Tim
e
BusyLockBarrier
RADIOSITY SPLASH 2
Can we use these coresto accelerate this app.?
Intl. Symp. on Microarchitecture - December 2011 8
Outline
Introduction Motivation Proposal Evaluation Methodology Results Low power nested parallelism Conclusions
9
Proposal
Programming:– Users explicitly parallelize code– Tradeoff development time for performance gains
Architecture and Compiler:– Exploit fine-grain parallelism on top of user threads– Thread-Level Speculation (TLS) within each user thread
Hardware:– Support both explicit and implicit threads simultaneously
in a nested fashion
Intl. Symp. on Microarchitecture - December 2011
Speculative
10
Proposal#pragma omp parallel forfor(j = 0; j < M; ++j) { … for(i = 0; i < N; ++i) { … = A[L[i]] + … … A[K[i]] = … } …}
T0 TK TL TM
… … …
TK,i TK,i+1 TK,i+2 TK,i+3
Speculative
TL,i TL,i+1 TL,i+2 TL,i+3
Intl. Symp. on Microarchitecture - December 2011
11
Proposal: Many-core Architecture
Many-core partitioned in clusters (tiles) Coherence (MESI)
– Snooping coherence within cluster– Directory coherence across clusters
Support for TLS only within cluster– Snooping TLS protocol– Speculative buffering in L1 data caches
Intl. Symp. on Microarchitecture - December 2011
12
Proposal: Many-core Architecture
T0 T1 T2 T3 T4 T5 T6 T7
T8 T9 T10 T11 T12 T13 T14 T15
T16 T17 T18 T19 T20 T21 T22 T23
T24 T25 T26 T27 T28 T29 T30 T31
Mem
. Con
tr.M
em. C
ontr.
Mem
. Con
tr.M
em. C
ontr.
C0 C1 C2 C3
IC DC IC DC IC DC IC DC
L2 $ Dir/Router
Intl. Symp. on Microarchitecture - December 2011
Intl. Symp. on Microarchitecture - December 2011 13
Complementing Coarse-Grain ParallelismT0 T1 T2 T3
Tim
e
T0 T1 T2 T3 T4 T5 T6 T7
2x Explicit Threads
Intl. Symp. on Microarchitecture - December 2011 14
Complementing Coarse-Grain ParallelismT0 T1 T2 T3
Tim
e
T0 T1 T2 T3 T4 T5 T6 T7
4ETs + 4ISTs
Intl. Symp. on Microarchitecture - December 2011 15
Complementing Coarse-Grain ParallelismT0 T1 T2 T3
Tim
e
T0 T1 T2 T3 T4 T5 T6 T7
2x Explicit Threads
Intl. Symp. on Microarchitecture - December 2011 16
Complementing Coarse-Grain ParallelismT0 T1 T2 T3
Tim
e
T0 T1 T2 T3 T4 T5 T6 T7
4ETs + 4ISTs
Intl. Symp. on Microarchitecture - December 2011 17
Expected Speedup Behavior
A
B
C
Sp
eed
up
Cores
Baseline
4-way TLS speedupregion
2-way TLS speedupregion
Baseline speedupregion
1 2 4 8 16 32 64
2-way TLS
4-way TLS
18
Proposal: Auto-Tuning the Thread Count Find the scalability tipping point dynamically Choose whether to employ implicit threads Simple hill climbing approach Applicable to OpenMP applications that are
amenable to Dynamic Concurrency Throttling (DCT [Curtis-Maury PACT’08] )
Developed a prototype in the Omni OpenMP System
Intl. Symp. on Microarchitecture - December 2011
Intl. Symp. on Microarchitecture - December 2011 19
Auto-tuning example
…#pragma omp parallel forfor(j = 0; j < M; ++j) { … for(i = 0; i < N; ++i) { … = A[L[i]] + … … A[K[i]] = … } …}…
Learningi
omp parallel region i detected:
First time:Can we compute iteration count statically and is less than max core count?
Yes -> set Initial Tcount to 32Measure execution time ti
1
M=32
Intl. Symp. on Microarchitecture - December 2011 20
Auto-tuning example
…#pragma omp parallel forfor(j = 0; j < M; ++j) { … for(i = 0; i < N; ++i) { … = A[L[i]] + … … A[K[i]] = … } …}…
Learningi i
omp parallel region i detected:
Set Tcount to next value (16)Measure execution time ti
2
ti2 < ti
1 → continue exploration
Intl. Symp. on Microarchitecture - December 2011 21
Auto-tuning example
…#pragma omp parallel forfor(j = 0; j < M; ++j) { … for(i = 0; i < N; ++i) { … = A[L[i]] + … … A[K[i]] = … } …}…
Learningi i i
omp parallel region i detected:
Set Tcount to next value (8)Measure execution time ti
3
ti3 > ti
2 → stop exploration
Intl. Symp. on Microarchitecture - December 2011 22
Auto-tuning example
…#pragma omp parallel forfor(j = 0; j < M; ++j) { … for(i = 0; i < N; ++i) { … = A[L[i]] + … … A[K[i]] = … } …}…
Learningi i i i
omp parallel region i detected:Use Tcount = 16, no further explorationSet TLS to 4-way
Intl. Symp. on Microarchitecture - December 2011 23
Outline
Introduction Motivation Proposal Evaluation Methodology Results Conclusions
24
Evaluation Methodology
SESC simulator - extended to model our scheme Architecture:
– Core: 4-issue OoO superscalar, 96-entry ROB, 3GHz 32KB, 4-way, DL1 $ - 32KB, 2-way, IL1 $ 16Kbit Hybrid Branch Predictor
– Tile/System: 128 cores partitioned in 2-way or 4-way tiles (evaluate both) Shared L2 cache, 8MB, 8-way, 64MSHRs Directory: Full-bit vector sharer list Interconnect: Grid, 64B links - 48GB/s to main memory
Intl. Symp. on Microarchitecture - December 2011
25
Evaluation Methodology
Benchmarks:– 12 workloads from PARSEC 2.1, SPLASH2, NASPB– Simulate parallel region to completion
Compilation:– MIPS binaries generated using GCC 3.4.4– Speculation added automatically through source-to-
source compiler– Selection of speculation regions through manual profiling
Power:– CACTI 4.2 and Wattch
Intl. Symp. on Microarchitecture - December 2011
26
Evaluation Methodology
Alternative schemes compared against:– Core Fusion [Ipek ISCA’07]:
Dynamic combination of cores to deal with lowly-threaded apps Approximated through wide 8-issue cores with all the core
resources doubled without latency increase => upper bound – Frequency Boost:
Inspired by Turbo Boost [Intel’08] For each idle core one other core gains a frequency boost of
800MHz with a 200mV increase in voltage (same power cap)
All these schemes shift resources to a subset of cores in order to improve performance
Intl. Symp. on Microarchitecture - December 2011
Intl. Symp. on Microarchitecture - December 2011 27
Outline
Introduction Motivation Proposal Evaluation Methodology Results Conclusions
Intl. Symp. on Microarchitecture - December 2011 28
Bottom Line
Speedup over best scalability point
0.8
1.0
1.2
1.4
1.6
1.8
2.0
Spee
dup
Benchmark
TLS-2TLS-4CFusionFBoost
TLS-4: 41% avgTLS-2:27% avg
0.6
0.8
1.0
1.2
1.4
1.6
1.8
2.0
Nor
mal
ized
Ene
rgy
Benchmark
2TLS4TLSCFusionFBoost
Intl. Symp. on Microarchitecture - December 2011 29
Energy
Showing best performing point for each schemeEnergy consumptionslightly lower on avg
0.6
0.8
1.0
1.2
1.4
1.6
1.8
2.0
Nor
mal
ized
Ene
rgy
Benchmark
2TLS4TLSCFusionFBoost
Intl. Symp. on Microarchitecture - December 2011 30
Energy
Showing best performing point for each schemeSpending less time in busy synchronization
0.6
0.8
1.0
1.2
1.4
1.6
1.8
2.0
Nor
mal
ized
Ene
rgy
Benchmark
2TLS4TLSCFusionFBoost
Intl. Symp. on Microarchitecture - December 2011 31
Energy
Showing best performing point for each schemeHigh mispeculation:
Higher energy
0.6
0.8
1.0
1.2
1.4
1.6
1.8
2.0
Nor
mal
ized
Ene
rgy
Benchmark
2TLS4TLSCFusionFBoost
Intl. Symp. on Microarchitecture - December 2011 32
Energy
Showing best performing point for each schemeLittle synchronization:
Higher energy
Intl. Symp. on Microarchitecture - December 2011 33
Serial/Critical Sections
0 10 20 30 40 50 60 70 80 90 100 110 120Cores
0
1
2
3
4
5
6
Sp
ee
du
pbaseTLS-2TLS-4FBoostCFusion
2 4 8 16 32 64 128Cores
0
0.5
1.0
1.5
2.0
2.5
3.0
Nor
m. E
xecu
tion
Tim
e
BusyLockBarrier
isNASPB
Intl. Symp. on Microarchitecture - December 2011 34
Load Imbalance
0 10 20 30 40 50 60 70 80 90 100 110 120Cores
0
5
10
15
20
25
Sp
ee
dup
baseTLS-2TLS-4FBoostCFusion
2 4 8 16 32 64 128Cores
0
0.5
1.0
1.5
2.0
2.5
Nor
m. E
xecu
tion
Tim
e
BusyLockBarrier
radiositySPLASH2
Intl. Symp. on Microarchitecture - December 2011 35
Synchronization Heavy
0 10 20 30 40 50 60 70 80 90 100 110 120Cores
0
2
4
6
8
10
12
14S
pe
ed
up
baseTLS-2TLS-4FBoostCFusion
2 4 8 16 32 64 128Cores
00.20.40.60.81.01.21.41.61.8
Nor
m. E
xecu
tion
Tim
e
BusyLockBarrier
oceanSPLASH2
Intl. Symp. on Microarchitecture - December 2011 36
Coarse-Grain Partitioning
0 10 20 30 40 50 60 70 80 90 100 110 120Cores
0
5
10
15
20
25
30
Sp
ee
du
pbaseTLS-2TLS-4FBoostCFusion
2 4 8 16 32 64 128Cores
00.20.40.60.81.01.21.41.61.82.0
Nor
m. E
xecu
tion
Tim
e
BusyLockBarrier
swaptionsPARSEC
Intl. Symp. on Microarchitecture - December 2011 37
Poor Static Partitioning
0 10 20 30 40 50 60 70 80 90 100 110 120Cores
0
2
4
6
8
10
12S
pe
ed
up
baseTLS-2TLS-4FBoostCFusion
2 4 8 16 32 64 128Cores
00.20.40.60.81.01.21.41.61.8
Nor
m. E
xecu
tion
Tim
e
BusyLockBarrier
spNASPB
Intl. Symp. on Microarchitecture - December 2011 38
Effect of Dataset size
Unchanged behavior: cholesky Also: canneal, ocean, ft, is, sp
10 20 30 40 50 60 70 80 90 100 110 120Cores
0
2
4
6
8
10
12
14
Spe
edup
basebaseLTLS-2TLS-2LTLS-4TLS-4L
Intl. Symp. on Microarchitecture - December 2011 39
Effect of Dataset size
Improved scalability, but TLS boost remains: swaptions Also: bodytrack, radiosity, ep
0 10 20 30 40 50 60 70 80 90 100 110 120Cores
0
5
10
15
20
25
30
35
40
45
50
Spe
edup
basebaseLTLS-2TLS-2LTLS-4TLS-4L
30 40 50 60 70 80 90 100 110 120Cores
0
10
20
30
40
50
60
Spe
edup
basebaseLTLS-2TLS-2LTLS-4TLS-4L
Intl. Symp. on Microarchitecture - December 2011 40
Effect of Dataset size
Improved scalability, lessened TLS boost: streamcluster
Intl. Symp. on Microarchitecture - December 2011 41
Effect of Dataset size
Worse scalability, even better TLS boost: water
10 20 30 40 50 60 70 80 90 100 110 120Cores
0
10
20
30
40
50
60
Spe
edup
basebaseLTLS-2TLS-2LTLS-4TLS-4L
Intl. Symp. on Microarchitecture - December 2011 42
Outline
Introduction Motivation Proposal Evaluation Methodology Results Conclusions
Intl. Symp. on Microarchitecture - December 2011 43
Conclusions
Multicores and many-cores are here to stay– Parallel programming essential to exploit new hardware– Some coarse-grain parallel programs do not scale– Enough nested parallelism to improve scalability
Proposed speculative parallelization through implicit speculative threads on top of explicit threads:
– Significant scalability improvement of 40% on avg– No increase in total energy consumptions– Presented an auto-tuning mechanism to dynamically choose
the number of threads that performs within 6% of the oracle
Complementing User-Level Coarse-Grain Parallelism
with Implicit Speculative Parallelism
Nikolas Ioannou, Marcelo Cintra
School of InformaticsUniversity of Edinburgh
Intl. Symp. on Microarchitecture - December 2011 45
Related Work
[von Praun PPoPP’07] Implicit ordered transactions [Kim Micro’10] Speculative Parallel-stage Decoupled Software
Pipelining [Ooi ICS’01] Multiplex [Madriles ISCA’09] Anaphase [Rajwar MICRO’01],[Martinez ASPLOS’02] Speculative Lock
Elision [Moravan ASPLOS’06], etc., Nested transactional memory
Intl. Symp. on Microarchitecture - December 2011 46
Bibliography
[Intl’08] Intel Corp. Intel Turbo Boost Technology in Intel Core Microarchitecture (Nehalem) Based Processors, White Paper, 2008
[Ipek ISCA’07] Ipek et al. Core fusion: Accommodating software diversity in chip multiprocessors
[von Praun PPoPP’07] C. von Praun et al. Implicit parallelism with ordered transactions, PPoPP 2007
[Kim Micro’10] Scalable speculative parallelization in commodity clusters, MICRO, 2010
[Ooi ICS’01] C.-L Ooi et al. Multiplex: Unifying conventional and speculative thread-level parallelism on a chip multiprocessor, ICS 2001
[Madriles ISCA’09] C. Madriles et al. Boosting single-thread performance in multi-core system through fine-grain multi-threading. ISCA 2009
Intl. Symp. on Microarchitecture - December 2011 47
Bibliography
[Rajwar MICRO’01] R. Rajwar and J.R. Goodman. Speculative lock elision: Enabling highly concurrent multithreaded execution. MICRO 2001
[Martinez ASPLOS’02] J. Martinez and J. Torellas. Speculative synchronization: Applying thread-level speculation to explicitly parallel applications. ASPLOS 2002
[Moravan ASPLOS’06] Supporting nested transactional memory in logtm. ASPLOS 2006
[Curtis-Maury PACT’08] Prediction models for multi-dimensional power-performance optimization on many-cores.
Intl. Symp. on Microarchitecture - December 2011 48
Benchmark details
Intl. Symp. on Microarchitecture - December 2011 49
Fetched Instructions
0.5
0.6
0.7
0.8
0.9
1.0
1.1
1.2
Norm
. Tot
al F
etch
ed In
s.
Benchmark
TLS-2TLS-4FBoostCFusion
Intl. Symp. on Microarchitecture - December 2011 50
Failed Speculation
0%
20%
40%
60%
80%
100%TL
S-2
TLS
-4
TLS
-2TL
S-4
TLS
-2TL
S-4
TLS
-2TL
S-4
TLS
-2TL
S-4
TLS
-2TL
S-4
TLS
-2TL
S-4
TLS
-2TL
S-4
TLS
-2TL
S-4
TLS
-2TL
S-4
TLS
-2TL
S-4
TLS
-2TL
S-4
Nor
m. E
xecu
tion
Tim
e
Benchmark
RestartBusy
Intl. Symp. on Microarchitecture - December 2011 51
Serial/Critical Sections
0 10 20 30 40 50 60 70 80 90 100 110 120Cores
0
1
2
3
4
5
6
7S
pe
ed
up
baseTLS-2TLS-4FBoostCFusion
2 4 8 16 32 64 128Cores
00.20.40.60.81.01.21.41.6
Nor
m. E
xecu
tion
Tim
e
BusyLockBarrier
bodytrackPARSEC
Intl. Symp. on Microarchitecture - December 2011 56
Auto-tuning
OpenMP apps Performs within 6% of static oracle
0.0
5.0
10.0
15.0
20.0
25.0
ep ft is sp
Sp
eed
up
Benchmark
Static OracleAuto-tuning