Date post: | 29-Dec-2015 |
Category: |
Documents |
Upload: | sharyl-howard |
View: | 220 times |
Download: | 1 times |
Pınar Tözün
Anastasia Ailamaki
SLICC Self-Assembly of Instruction Cache Collectivesfor OLTP Workloads
Islam Atta
Andreas Moshovos
SLICC
$100 Billion/Yr, +10% annually
•E.g., banking, online purchases, stock market…
Benchmarking
•Transaction Processing Council
•TPC-C: Wholesale retailer
•TPC-E: Brokerage market
Online Transaction Processing (OLTP)
OLTP drives innovation for HW and DB vendors
© Islam Atta 2
SLICC
Many concurrent transactions
Transactions Suffer from Instruction Misses
L1-I size
Footp
rin
t
Each
Tim
e
Instruction Stalls due to L1 Instruction Cache Thrashing© Islam Atta
3
SLICC
Even on a CMP all Transactions Suffer
CoresL1-1 Caches
Transactions
All caches thrashed with similar code blocks
Tim
e
© Islam Atta 4
SLICC
Opportunity
Footprint over Multiple Cores Reduced Instruction Misses
Technology:
• CMP’s aggregate L1
instruction cache capacity
is large enough
Application Behavior:
• Instruction overlap within
and across transactions
Multiple L1-I caches
Multiple threads
Tim
e
© Islam Atta 5
SLICC
Dynamic Hardware Solution
• How to divide a transaction
• When to move
• Where to go
Performance
•Reduces instruction misses by 44% (TPC-C), 68% (TPC-E)
•Performance improves by 60% (TPC-C), 79% (TPC-E)
Robust:
• non-OLTP workload remains unaffected
SLICC Overview
© Islam Atta 6
SLICC
• Intra/Inter-thread instruction locality is high• SLICC Concept
• SLICC Ingredients
• Results
• Summary
Talk Roadmap
© Islam Atta 7
SLICC
Many concurrent transactions
Few DB operations
•28 – 65KB
Few transaction types
•TPC-C: 5, TPC-E: 12
Transactions fit in 128-512KB
OLTP Facts
Overlap within and across different transactions
R() U() I() D() IT() ITP()
PaymentNew Order
CMPs’ aggregate L1-I cache is large enough© Islam Atta
8
SLICC
Instruction Commonality Across Transactions
Lots of code reuse
More Yellow
Even higher across same-type transactions
Most
Few
Single
TPC-C TPC-E
All Threads
Per TransactionType
More Reuse
© Islam Atta 9
SLICC
Enable usage of aggregate L1-I capacity
•Large cache size without increased latency
Exploit instruction commonality
•Localizes common transaction instructions
Dynamic
•Independent of footprint size or cache configuration
Requirements
© Islam Atta 10
SLICC
• Intra/Inter-thread instruction locality is high
• SLICC Concept• SLICC Ingredients
• Results
• Summary
Talk Roadmap
© Islam Atta 11
SLICC
Example for Concurrent Transactions
T1 T2 T3
Code segments that can fit into L1-I
TransactionsControl FlowGraph
© Islam Atta 12
SLICC
T1 T2T1
T1
T3
T2 T3T1
T1
Scheduling Threads
T1 T2
T2 T3
T1 T3
0 1 2 3
CORES
T3
Conventional
L1-I
T1
T2
T3
ThreadsTi
me
T1
T1
0 1 2 3
CORES
SLICC
T1
T2
T3 T2
T1 T3
T3
T1T1
Cache Filled 10 times Cache Filled 4 times
T2 T2T2
© Islam Atta 13
SLICC
• Intra/Inter-thread instruction locality is high
• SLICC Concept
• SLICC Ingredients• Results
• Summary
Talk Roadmap
© Islam Atta 14
SLICC
When to migrate?
Step 1:
Detect: cache full
Step 2:
Detect: new code segment
Where to go?
Step 3:
Predict where is the next code segment?
Migration Ingredients
© Islam Atta 15
SLICC
Migration Ingredients
Tim
e
Idle coresWhen to migrate?
Step 1:
Detect: cache full
Step 2:
Detect: new segment
Where to go?
Step 3:
Where is the next segment?
Loops
IdleReturn back
T1
© Islam Atta 16
SLICC
Migration Ingredients
When to migrate?
Step 1:
Detect: cache full
Step 2:
Detect: new segment
Where to go?
Step 3:
Where is the next segment?
Tim
e
T2
© Islam Atta 17
SLICC
Implementation
When to migrate?
Step 1:
Detect: cache full
Step 2:
Detect: new segment
Where to go?
Step 3:
Where is the next segment?
Find signature blocks on
remote cores
Miss Counter
Miss Dilution
© Islam Atta 18
SLICC
More overlap across transactions of the same-type
SLICC: Transaction Type-oblivious
Transaction Type-aware
•SLICC-Pp: Pre-processing to detect similar transactions
•SLICC-SW : Software provides information
Boosting Effectiveness
© Islam Atta 19
SLICC
• Intra/Inter-thread instruction locality is high
• SLICC Concept
• SLICC Ingredients
• Results• Summary
Talk Roadmap
© Islam Atta 20
SLICC
How does SLICC affect INSTRUCTION misses?
Our primary goal
How does it affect DATA misses?
Expected to increase, by how much?
Performance impact:
Are DATA misses and MIGRATION OVERHEADS amortized?
Experimental Evaluation
© Islam Atta 21
SLICC
Simulation
•Zesto (x86)
•16 OoO cores, 32KB L1-I, 32KB L1-D, 1MB per core L2
•QEMU extension
•User and Kernel space
Workloads
Methodology
Shore-MT
© Islam Atta 22
SLICC
Baseline: no effort to reduce instruction misses
Effect on MissesB
ett
er
Reduce I-MPKI by 58%. Increase D-MPKI by 7%.
I-MPKI
D-MPKI
Base
SLIC
C
SLIC
C-S
W
Base
SLIC
C
SLIC
C-S
W
Base
SLIC
C
SLIC
C-S
W
TPC-C-10 TPC-E MapReduce
0
5
10
15
20
25
30
35
40
45
MP
KI
© Islam Atta 23
SLICC
Next-line: always prefetch the next-line
Upper bound for Proactive Instruction Fetch [Ferdman, MICRO’11]
Performance
TPC-C-1 TPC-C-10 TPC-E MapReduce1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2
Sp
eed
up
Bett
er
TPC-C: +60% TPC-E: +79%
Storage per core- PIF: ~40KB- SLICC: <1KB.
Next-Line
PIF-No Overhead
SLICC
SLICC-SW
© Islam Atta 24
SLICC
OLTP’s performance suffers due to instruction stalls.
Technology & Application Opportunities:
• Instruction footprint fits on aggregate L1-I capacity of CMPs.
• Inter- and intra-thread locality.
SLICC:
• Thread migration spread instruction footprint over multiple cores.
• Reduce I-MPKI by 58%
• Improve performance by
Summary
Baseline: +70%
Next-line: +44%
PIF: ±2% to +21%
© Islam Atta 25
SLICC
Example: thread migrates from core A core B.
•Read data on core B that is fetched on core A.
•Write data on core B to invalidate data on core A.
•When returning to core A, cache blocks might be evicted by other
threads.
Why data misses increase?
© Islam Atta 27
SLICC
SLICC Agent per Core
MSV(Miss Shift-Vector)
Count “1”s
MC(Miss Counter)
≥
Fill-up_t
...
Enable shifting
Dilution_t
Locating Missed Blocks on Remote
Cores
Miss Tag-Queue (MTQ)
EnableMigration Select Matching Core
Mat
ched
_t
entr
ies
≥
EnableSearching
+Remote Cache Segment Search
Cache Full DetectionMiss(1)Hit(0)
Miss Dilution Tracking
© Islam Atta 28
SLICC
Zesto (x86)
Qtrace (QEMU extension)
Shore-MT
Detailed Methodology
© Islam Atta 29
SLICC
Hardware Cost
© Islam Atta 30
SLICC
Larger I-caches?
16
32
64
128
256
512
16
32
64
128
256
512
16
32
64
128
256
512
16
32
64
128
256
512
16
32
64
128
256
512
16
32
64
128
256
512
Instructions Data Instructions Data Instructions DataTPC-C-10 TPC-E MapReduce
0
10
20
30
40
50
60
0
0.2
0.4
0.6
0.8
1
1.2
1.4Conflict Capacity Compulsory Speedup
MP
KI
Cache Size (K)
Sp
eed
Up
Bett
er
Bett
er
© Islam Atta 31
SLICC
Different Replacement Policies?
TPC-C TPC-E MapReduce0
5
10
15
20
25
30
35
40LRU LIP BIP DIP SRRIP BRRIP DRRIP
L1 I
nstr
ucti
on
MP
KI
Bett
er
© Islam Atta 32
SLICC
Parameter Space (1)
Base
128
256
384
512
128
256
384
512
128
256
384
512
128
256
384
512
128
256
384
512
Base
128
256
384
512
128
256
384
512
128
256
384
512
128
256
384
512
128
256
384
512
2 4 6 8 10 2 4 6 8 10TPC-C TPC-E
0
10
20
30
40
50
60
70
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
I-MPKI D-MPKI Speedup
Fill-up_t (top), Matched_t (bottom)
MP
KI
Sp
eed
up
Bett
er
Bett
er
© Islam Atta 33
SLICC
Parameter Space (2)
2 4 6 81
01
21
41
61
82
02
22
42
62
83
0 2 4 6 81
01
21
41
61
82
02
22
42
62
83
0
TPC-C TPC-E
0
10
20
30
40
50
60
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2I-MPKI D-MPKI Speedup
Dilution_t
MP
KI
Sp
eed
up
Bett
er
Bett
er
© Islam Atta 34
SLICC
Partial Bloom Filter
Cache Signature Accuracy
512 1K 2K 4K 8K 512 1K 2K 4K 8KTPC-C TPC-E
96
97
98
99
100
101
BF AccuracyA
ccu
racy
(%)
Bett
er
© Islam Atta 35