1 UNIVERSITY OF MASSACHUSETTS, AMHERST • School of Computer Science
PREDATOR: Predictive False Sharing Detection
Tongping Liu*, Chen Tian, Ziang Hu, Emery Berger*
*University of Massachusetts AmherstHuawei US Research Center
2
Parallelism: Expectation is Awesome
Ru
nti
me (
s)
1 2 4 80
10
20
30
40
50
60
70
80
90
Number of threads
Expectation
Parallel Program
int count[8];int W; void increment(int S) { for(in=S; in<S+W; in++) for(j=0; j<1M; j++) count[in]++;}
int main(int THREADS) { W=8/THREADS; for(i=0; i<8; i+=W) spawn(increment,i);}
3
1 2 4 80
20
40
60
80
100
120
140
Number of threads
False sharing slows the program by 13X
Ru
nti
me (
s)
Parallel Program Expectation
Reality
Parallelism: Reality is Awful
int count[8];int W; void increment(int S) { for(in=S; in<S+W; in++) for(j=0; j<1M; j++) count[in]++;}
int main(int THREADS) { W=8/THREADS; for(i=0; i<8; i+=W) spawn(increment,i);}
False sharing
4
False Sharing in Real Applications
False sharing slows MySQL by 50%
5
Cache Line
False Sharing vs. True Sharing
6
Task 3Task 1
Task 2 Task 4
False Sharing
Task 1
TrueSharing
Task 2
False Sharing vs. True Sharing
7
Resource Contention at Cache Line Level
8
Thread 1
Main Memory
Core 1
Thread 2
Core 2
Cache
Cache
Invalidate
Cache line: basic unit of data transfer
False Sharing Causes Performance Problems
9
Thread 1 Thread 2
Cache
Cache
Invalidate
Interleaved accesses cause cache invalidations
Main Memory
Core 1 Core 2
False Sharing Causes Performance Problems
10
me = 1;you = 1; // globals
me = new Foo;you = new Bar; // heap
class X { int me; int you;}; // fields
array[me] = 12;array[you] = 13; // array indices
False Sharing is Everywhere
11
False Sharing is Hard to Diagnose
Multiple experts worked together to diagnose MySQL scalability issue (1.5M LOC)
12
Problems of Existing Tools
• No precise information/false positives–WIBA’09, VEE’11, EuroSys’13, SC’13
• Accurate & Precise– OOPSLA’11 ( Cannot detect read-write
FS)Shared problem: only detect observed false sharing
13
Task 1
Task 2
Cache
Cache
Invalidate
Main Memory
Core 1 Core 2
False Sharing Causes Performance Problems
Find cache lines with many cache invalidations
Interleaved accesses
Cache invalidations
Performance problems
Detect false sharing causing performance problems
14
Find Lines with Many Invalidations
. . . . . . .
……
Track cache invalidations on each cache line
Memory: Global, Heap
15
Track Cache Invalidations
• Hardware-based approach– Needs hardware
support– No portability
• Simulation-based approach– Needs hardware info
such as cache hierarchy, cache capacity
– Very slow
• Conservative Assumptions– Each thread runs on a
different core with its private cache.
– Infinite cache capacity.
PREDATOR: based on memory access history of each cache line
16
Track Cache Invalidations
r w r ww r w r
T1 T2
0
# of invalidations
12
Time
30 0 0 0T2 r T1 rT2 w
Each Entry: { Thread ID, Access Type}
T2 w 0 0T1 wT2 w 0 0T1 r
17
PREDATOR Components
Compiler Instrumenta
tion
Runtime System
Instruments every memory read/write
access
Collects memory accesses and reports
false sharing
18
Detect Problems Correctly & Precisely
• Correctly: –No false
alarms
Task 3Task 1
Task 2 Task 4
False Sharing
Task 1
TrueSharing
Task 2
Track memory accesses on each word
• Precisely– Global variables–Heap objects: pinpoint the line of memory
allocation
19
PREDATOR’s Report
20
Why do we need prediction?
21
Necessity of False Sharing Prediction
Thread 1 Thread 2
Cache line 1 Cache line 2
Cache line 1 Cache line 2
False Sharin
g
Cache line 1
False Sharin
g
22
Properties Affecting False Sharing Occurrence
32-bit platform 64-bit platformDifferent memory allocatorDifferent compiler or optimizationDifferent allocation order by changing the
code, e.g., printf
• Change of memory layout
• Run on hardware with different cache line size
23
Example of False Sharing Sensitivity
Offset = 0
Offset = 8
Offset = 56
……
Memory
Colors represent threads
Cache line size = 64 bytes
24
Offse
t=0
Offse
t=8
Offse
t=16
Offse
t=24
Offse
t=32
Offse
t=40
Offse
t=48
Offse
t=56
0
1
2
3
4
5
6
Ru
nti
me (
Secon
ds)
PREDATOR predicts false sharing
problems without occurrence
Example of False Sharing Sensitivity
25
Prediction Based on Virtual Cache Lines
Thread 1 Thread 2
Cache line 1 Cache line 2
Virtual cache line 1 Virtual cache line 2
False Sharin
g
Virtual cache line 1
False Sharin
g
Real case
Prediction 1
Prediction 2
26
d YX
(sz-d)/2 (sz-d)/2
Tracked virtual line
Non-tracked virtual lines
Track Invalidations on Virtual Cache Lines
d < the cache line size - sz(X, Y) from different threads && one of
them is write
27
Benchmark Results
Benchmarks Unknown Problem
Without Prediction
With Prediction Improvement
Histogram ✔ ✔ ✔ 46%
Linear_regression ✔ 1207%
Reverse_index ✔ ✔ 0.09%
Word_count ✔ ✔ 0.14%
Streamcluster-1 ✔ ✔ ✔ 4.77%
Streamcluster-2 ✔ ✔ 7.52%
28
Real Applications Results
• MySQL– Problem: False sharing occurs when
different threads update the shared bitmap simultaneously.
– Performance improves 180% after fixes.
• Boost library:– Problem: “there will be 16 spinlocks per
cache line”– Performance improves about 100%.
29
Performance Overhead of PREDATOR
Phoenix
histogra
m
kmea
ns
linear_
regressi
on
matrix_
multiply pca
revers
e_index
string_m
atch
word_c
ount
PARSEC
blacks
choles
bodytrac
k
dedup
ferret
fluidanimate
strea
mcluste
r
swap
tions x2
64
RealApplica
tionsag
etBoost
Mem
cach
ed
MyS
QL
pbzip2
pfscan
AVERAGE
0
3
6
9
12
15
Execution Time Overhead
Original
PREDATOR-NP
PREDATOR
Nor
mal
ized
Runti
me
2326
5.6X
30
Compiler Instrumenta
tionRuntime System
Thread 1
Thread 2
Cache
Cache
Invalidate
Main Memory
Core 1 Core 2
Precise report
Thread 1 Thread 2
Cache line 1 Cache line 2
Virtual cache line 1Virtual cache line 2
False Sharin
g
Virtual cache line 1
False Sharin
g
Real case
Prediction 1
Prediction 2
31
32
False Sharing is Hard to Diagnose
Multiple experts worked together to diagnose MySQL scalability issue (1.5M LOC)
33
Detailed Prediction Algorithm
1. Find suspected cache lines
34
Detailed Prediction Algorithm
1. Find suspected cache lines
2. Track detailed memory accesses
35
Detailed Prediction Algorithm
1. Find suspected cache lines
2. Track detailed memory accesses
3. Predict based on hot accesses
YX
d d < sz && (X, Y) from different
threads, potential false sharing
36
4: Tracking Cache Invalidations on the Virtual Line
d YX
(sz-d)/2 (sz-d)/2
Tracked virtual line
Non-tracked virtual lines