Optimizing for MP:s
Erik HagerstenUppsala University, Sweden
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~ehOPT 2
AVDARK2012
Cache Waste/* Unoptimized */
for (s = 0; s < ITERATIONS; s++){
for (j = 0; j < HUGE; j++)
x[j] = x[j+1]; /* will hog the cache but not benefit*/
for (i = 0; i < SMALLER_THAN_CACHE; i++)
y[i] = y[i+1]; /* will be evicted between usages /*
}
/* Optimized */
for (s = 0; s < ITERATIONS; s++){
for (j = 0; j < HUGE; j++) {
PREFETCH_NTA x[j+1] /* will be installed in L1, but not L3 (AMD) *
x[j] = x[j+1];
for (i = 0; I < SMALLER_THAN_CACHE; i++)
y[i] = y[i+1]; /* will always hit in the cache*/
}
Also important for single-threaded applications if they
are co-scheduled and share cache with other applications.
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~ehOPT 3
AVDARK2012
cache size
cache misses
actualactual/4
The larger cache, the better
UART Research: Hints to avoid cache pollution(non-temporal prefetches)
Hint:Don’t
allocate!missrate2x missrate
0
1
2
3
Original Lim=1.7MB
One Instance Four Instances
40% faster
Hint: lim= actual/4
Orig
Thro
ughp
ut
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~ehOPT 4
AVDARK2012
Categorize and avoiding cache wasteMissrate
$-size
∆ benefit
Cachehogging
L2L1CPU
L1CPU
L1
L2
Mem
No point in caching! per-instruction cache
avoidence (prefetch.nta)
Hogging
∆ benefit
Don’t care
Slowsothers
Slowedby others
Slows &slowed
Hogging
∆ benefit
+
++
+
bzip LBM
LQ
+ ++
+
+
0
0,2
0,4
0,6
0,8
1
1,2
bzip2 Libquantum LBM Geom mean
Individually In mix In mix, patched
25%
AMD Opteron
Perf
orm
ance Andreas Sandberg, David Eklov and Erik
Hagersten. Reducing Cache Pollution Through Detection and Elimination of Non-Temporal Memory Accesses, In Proceedings of Supercomputing (SC), New Orleans, LA, USA, November 2010.
Automatic ”taming” of the hoggersApplication classification
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~ehOPT 5
AVDARK2012
Coherence traffic
Thread 0:int a, total;
spawn_child()
for (int i; i< HUGE; i++) {
/* do some work */
a++;
}
join()
total = a;
Child:
for (int i; i< HUGE; i++) {
/* do some work*/
a++;
}
Thread 0:int a, total;
spawn_child()
for (int i; i< HUGE; i++) {
/* do some work */
a++;
}
join()
total += a;
Child:int b;
for (int i; i< HUGE; i++) {
/* do some work */
b++;
}
total += b;
OPT:
ORIG:
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~ehOPT 6
AVDARK2012
False sharing
Thread 0:int a, b;
spawn_child()
for (int i; i< HUGE; i++) {
...
a++;
}
join()
total = a + b;
Child:
for (int i; i< HUGE; i++) {
...
b++;
}
Thread 0:int a;
spawn_child()
for (int i; i< HUGE; i++) {
...
a++;
}
join()
total += a;
Child:int b;
for (int i; i< HUGE; i++) {
...
total += b;
}
OPT:
ORIG:
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~ehOPT 7
AVDARK2012
Coherence Utilization
Thread 0:vec_type x[HUGE];
for (int i; i< HUGE; i++) {
...
x[i].a++;
}
spawn_child()
...
join()
Child (Thread 1)
for (int i; i< HUGE; i++) {
y[i] = x[i].a;
}
ORIG:
x[0]abcde f
x[12abcde f
x[ab
struct vec_type{
int a;int b;int c;int d;int e;int f;
};
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~ehOPT 8
AVDARK2012
A Bad Example: ”POUNDING”
proc lock(lock_variable) {while (TAS[lock_variable]==1) {} /* bang on the lock until free */
}
proc unlock(lock_variable) {lock_variable := 0
}
Assume: The function TAS (test and set) -- returns the current memory value and atomicallywrites the busy pattern “1” to the memory
Generates too much traffic!!-- spinning threads produce traffic!
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~ehOPT 9
AVDARK2012
Optimistic Test&Set Lock ”spinlock”
proc lock(lock_variable) {while true {
if (TAS[lock_variable] ==0) break; /* bang on the lock once, done if TAS==0 */while(lock_variable != 0) {} /* spin locally in your cache until ”0” observed*/
} }
proc unlock(lock_variable) {lock_variable := 0
}
Much less coherence traffic!!-- still lots of traffic at lock handover!
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~ehOPT 10
AVDARK2012
Uppsala Programming for Multicore Architecture Center
62 MSEK grant / 10 years [$9M/10y]+ related additional grants at UU = 130MSEK
Research areas: Performance modeling New parallel algorithms Scheduling of threads and resources Testing & verification Language technology MC in wireless and sensors
Erik:
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehMem, VM and SW optimizations11
AVDARK2012
StatCache: Insight and EfficiencySlowdown 10% (for long-running applications)
mem
Probabilistic Cache Model
Address Stream1:read A2:read B3:read C4:write C5:read B6:read D7:read A8:read E9:read B
Host Computer Target Architecture
ArchitecturalParameters
Online Sampling Offline “Insight Technology”
core
core
... mem
L1
L1
L2
core
...
core
Modeled behavior
ApplicationFingerprint
5, 3,…ReuseDistance=5
ReuseDistance=3
SparseSampler ThreadSpotter
Advice
Randomly select accessesto monitor
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehMem, VM and SW optimizations12
AVDARK2012
UART: Efficient sparse sampling
A B D B E B A F D B . .1 4 5 6 7 8 9 10 11 12 … N. .32
.…
i=0
1.Use HW counter overflow to randomly select accesses to sample (e.g. ~on avergage every 1.000.000th access)
2. Set a watchpoint for the data cacheline they touch
3. Use HW counters to count #memory accesses until watchpoint trap
Sampling Overhead ~17% (10% at Acumem for long-running apps)
(Modeling with math < 100ms)
trap trap
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehMem, VM and SW optimizations13
AVDARK2012
Fingerprint ≈ Sparse reuse distance histogram
Reuse distance
h(d)
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehMem, VM and SW optimizations14
AVDARK2012
Miss?pmiss=m(#repl)
Modeling random caches with math(Assumtion: ”Constant” MissRatio)
A B D B E B A F D B . .1 4 5 6 7 8 9 10 11 12 … N. .32
# repl ≈ 5 * MissRatio
.…
#repl
pmissMiss Equation m
rdi=5
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehMem, VM and SW optimizations15
AVDARK2012
The cacheline Ais in a cache with
L cachelines
After 1Replacement
(1 – 1/L) chancethat A survives
(1 – 1/L)R chancethat A survives
A A A
After RReplacements
Assuming a fully associative cache
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehMem, VM and SW optimizations16
AVDARK2012
Miss?
16
pmiss=m(5 * MissRatio)
Modeling random caches with math(Assumtion: ”Constant” MissRatio)
A B D B E B A F D B . .1 4 5 6 7 8 9 10 11 12 … N. .32
# repl ≈ 5 * MissRatio pmiss=m(3 * MissRatio)
.…
# repl ≈ 3 * MissRate
n samples: MissRatio * n = Σm(rd(i) * MissRatio)i=0
n
m(repl)=1 – (1 – 1/L)repl
#repl
pmiss
Can be solved in a ”fraction of a second” for different L:s
Miss Equation m
rdi=5
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehMem, VM and SW optimizations17
AVDARK2012
17
Accuracy: Simulation vs. ”math” (Random replacement)
Mis
s ra
tio (
%)
Cache size (bytes)
vpr
gzip
ammp
Comparing simulation (w/ slowdown 100x) and math (”fractions of a second”)
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehMem, VM and SW optimizations18
AVDARK2012
A B A B
. 2 3 4 5 6 7 8 91.
B C C D E C C…
Sampled Reuse Pair A-A
Stack Distance: How many unique data objects? Answer: 3
12 ... N
Modeling LRU Caches: Stack distance...
If we know all reuses: How many of the reuses 2-6 go beyond End? Answer: 3
Stack_distance = Σ [d(i) > (End – k + 2)]k=Start
End
Start=2 End=6
rdi=5
Foreach sample: if (Stack_distance > L ) miss++ else hit++
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehMem, VM and SW optimizations19
AVDARK2012
A B A B
. 2 3 4 5 6 7 8 91.
B C C D E C C…
d(1)
12 ... N
But we only know a few reuse distances...
Estimate: How many of the reuses 2-6 go beyond End? Answer: Est_SD
Est_SD = Σ p[d(i) > (End - k)]k=Start
End
Assume that the distribution (aka histogram) of sampled reuses is representative for all accesses in that ”time window”
d(2) d(3)
d
h(d)
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehMem, VM and SW optimizations20
AVDARK2012
All SPEC 2006
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehMem, VM and SW optimizations21
AVDARK2012
Modeling coherence
rA B rD B E B rA F rD B . .1 4 5 6 7 8 9 10 11 12 … N. .32
.…
i=0
Record coherence-related interaction at runtime (Arch. Independen)Model coherence effects off-lineCan model different topologies and thread bindings off-line
trap trap
B E B wA F rD B . . .…trap trap
Thread A
Thread B
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehMem, VM and SW optimizations22
AVDARK2012
(Need to be Efficient)3: Our Approach
Machine-independent
runtimeinformation
Efficientmodeling
Draw conclusions,build tools
== ∑ (1 − (1 − ) ( ))1. Capture data locality information
Find ”best”:• Core type• Cache size• Thread scheduling• Frequency• Code optimizations…
Predict (for many options)• Cache statistics• Bandwidth requirement• Performance• Power consumption• Phase behavior ...
2. Measure impact of resource allocations
Solve equationsGather runtime info Add heuristics
3. Capture code usage information?
Clustering, K-means...4. Capture power properties?
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehMem, VM and SW optimizations23
AVDARK2012
The World’s ”best”: 1. Cache locality samplers & cache ”simulator” (OH ~20%)
Cache hitrate model for data and instructions (~10ms) Multi-threading model [a.k.a. Coherence model] (~10ms) Cache sharing model (~10ms)
2. Cache/BW quantitative measurements (OH ~5%) Cache sharing model (~10ms) Performance prediction & BW requirement (~10ms) Cache sharing model (CPI & BW) (~10ms)
3. DVFS models & run-time (power) management4. On-line phase detection tool (OH ~2%)
Phase-guided sampling Phase-guided power management
5. Simplest coherence protocol VIPS Two states, self-invalidation, no directory
simulated [MB]
mod
elle
d [M
B]
Cache allocationon multicore
$ size [MB]
mis
ses
Achievements
$ size
On real HWPerformance
Bandwidth
time
phases
time
DVFS:Performance: 98%Energy: 50%
CPI
BW
misses
Multi-threaded Case Study:Gauss-Seidel on Multicores
From Wallin et al, ICS 2006
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~ehOPT 25
AVDARK2012
Criteria for HPC Algorithms
Past: Minimize communication Maximize scalability (1000s of CPUs)
Optimize for Multicore chip: On-chip communication is “for free” Scalability is limited to ~10 threads The caches are tiny Memory bandwidth is the bottleneck
Data locality is key!
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~ehOPT 26
AVDARK2012
Selected HPC Wire Articles
More Than 16 Cores May Well Be Pointless Sandia Labs, Dec 07 2008
Up Against the Memory Wall”Never mind the cores. Just hand over the cache”
Michael Feltman, Dec 11 2008
HPC@Intel: When to Say No to ParallelismSanjiv Shah, Intel. January 14 2009
Finding a Door in the Memory WallErik Hagersten, Acumem. Feb-April 2009
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~ehOPT 27
AVDARK2012
Example: Gauss Seidel
1
1
2
2
1 1 1 1 1
1 1 1 1 1 1
1 1 1 1 1 1
2 2 1
2 2 2 2 2
2 2 2 2 2 2
LOOP:UPDATE ALL POINTS IF (convergence_test)
(Longer explanation: Finding a Door in the Memory Wall @ HPCWire)
Mission: “Maximize the parallelism and minimize the inter-thread communication”
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~ehOPT 28
AVDARK2012
State-of-the-art:Removing Dependence: Red/Black
1
1
1
1
1 1 1 1 1
1 1 1 1 1 1
1 1 1 1 1 1
1 2 1
2 1 2 2 1
1 2 1 2 1 2
LOOP: UPDATE ALL RED POINTSUPDATE ALL BLACK POINTSIF (convergence_test)
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~ehOPT 29
AVDARK2012
State-of-the-art:Red/Black, Parallelism = N2/2
Core 0
Core 1
1
1
1
1
1 1 1 1 1
1 2 1 2 1 2
2 1 1 1 1
1 2 1
2 1 2 2 1
1 1 1 1 1 1
LOOP:IN PARALLEL: UPDATE ALL RED POINTS
IN PARALELL: UPDATE ALL BLACK POINTS
IF (convergence_test)
Limited communication N2/2 parallelism Done!Only one problem…
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~ehOPT 30
AVDARK2012
Only One Problem: Performance
0
1
2
0 1 2 3 4 5 6 7 8
# Cores
Spee
dup
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~ehOPT 31
AVDARK2012
Back to the drawing board: Temporal blocking for seq. code
22
44
1
3
= active region
34 = current
= sweep path
= data dependence
1,2,3,4 = iteration number
= cacheline layout
LOOP:LOOP:
UPDATE ALL POINTS IN ACTIVE REGIONSLIDE DOWN THE REGION
IF (convergence_test)
Communication is “for free” and moderate parallelism is OKPriority 1: limit bandwidth needs!
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~ehOPT 32
AVDARK2012
Back to the drawing board: Temporal blocking for seq. code
32
44
1
4
12
= active region
= current
= sweep path
= data dependence
1,2,3,4 = iteration number
= cacheline layout
LOOP:LOOP:
UPDATE ALL POINTS IN ACTIVE REGIONSLIDE DOWN THE REGION
IF (convergence_test)
4 iterations inone sweep!
Communication is “for free” and moderate parallelism is OKPriority 1: limit bandwidth need!
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~ehOPT 33
AVDARK2012
0%
1%
2%
3%
256k 512k 1M 2M 4M 8M 16M 32M 64M 128M 256M 512MCache size
Red/BlackBlock=1Block=2Block=4Block=8Block=16
DRAM_traffic(cache_size)
Fetch Rate, i.e, fraction of mem_ops generating DRAM traffic
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~ehOPT 34
AVDARK2012
G-S, temp block Parallelism = N
32
44
1
4
Core 0 Core 1 Core 2 Core 3
0123
1 1
Synchronization flags
Wait until ”lefty” is done:Lots of communication
• Producer/Consumer Flag• Sharing of data values
Only N-fold parallelism
2
= active region
= current
= sweep path
= data dependence
1,2,3,4 = iteration number
= cacheline layout
1 = sync flag iteration no
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~ehOPT 35
AVDARK2012
Problems we ran into 1 (2)
32
44
1
4
Core 0 Core 1 Core 2 Core 3
0123
1 12
512 elements = 64 cache lines
512elem.
Core 0indexinginto L2 $
Core 1indexinginto L2 $
Core 2indexinginto L2 $
Core 3indexininto L2
16 cache lines
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~ehOPT 36
AVDARK2012
Problems we ran into 2 (2) We had a loop nesting problem that the
compiler optimized away
... sometimes
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~ehOPT 37
AVDARK2012
Running on a Multisocket
I/F
I/F
100
DRAM
DRAM
Coherence = Non-Uniform Coherence
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~ehOPT 38
AVDARK2012
Example: G-S, temp blocking
32
44
1
4
Core 0 Core 1 Core 2 Core 3
0123
1 12
= active region
= current
= sweep path
= data dependence
1,2,3,4 = iteration number
= cacheline layout
1 = sync flag iteration no
PADDING
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~ehOPT 39
AVDARK2012
0
1
2
3
4
5
6
7
0 1 2 3 4 5 6 7 8
Perfo
rman
ce
# Cores
Red/Black
Block=8
Lessons Learned: Optimize cacheusage BEFORE parallelizing
3x
[Wallin, Löf, Holmgren, Hagersten @ ICS 2006]
Demo Time!
G-S:DanW:s codeOptimized