Weaving Relations for Cache Performance
Anastassia AilamakiCarnegie Mellon
David DeWitt, Mark Hill, and Marios SkounakisUniversity of Wisconsin-Madison
Memory Hierarchies
PROCESSOR EXECUTION PIPELINE
MAIN MEMORY
L1 I-CACHE L1 D-CACHE
L2 CACHE
Cache misses are extremely expensive
$$$
MAIN MEMORY
10
0.336
70
01020304050607080
VAX 11/780 Pentium II Xeon
cycl
es p
er in
stru
ctio
n
01020304050607080
mem
ory
late
ncy
CPI Memory latency
1 access to memory 1000 instruction opportunities
Processor/Memory Speed Gap
Range Selection (no index)
0%
20%
40%
60%
80%
100%
A B C D
DBMS
Me
mo
ry s
tall
tim
e (
%)
Range selection (clustered index)
0%
20%
40%
60%
80%
100%
A B C DDBMS
L1 Data L2 Data L1 Instruction L2 Instruction
Join (no index)
0%
20%
40%
60%
80%
100%
A B C D
DBMS
Data accesses on caches: 19%-86% of memory stalls
PII Xeon running NT 4.0, 4 commercial DBMSs: A,B,C,D Memory-related delays: 40%-80% of execution time
Breakdown of Memory Delays
Data Placement on Disk Pages
Slotted Pages: Used by all commercial DBMSs Store table records sequentially Intra-record locality (attributes of record r together) Doesn’t work well on today’s memory hierarchies
Alternative: Vertical partitioning [Copeland’85] Store n-attribute table as n single-attribute tables Inter-record locality, saves unnecessary I/O Destroys intra-record locality => expensive to reconstruct record
Contribution: Partition Attributes Across … have the cake and eat it, too
Inter-record locality + low record reconstruction cost
Outline
The memory/processor speed gap What’s wrong with slotted pages? Partition Attributes Across (PAX) Performance results Summary
1237RH1PAGE HEADER
30Jane RH2 4322 John
45 RH3 Jim 20
RH4
7658 Susan 52
1563
RID SSN Name Age
1 1237 Jane 30
2 4322 John 45
3 1563 Jim 20
4 7658 Susan 52
5 2534 Leon 43
6 8791 Dan 37
R
Records are stored sequentially Offsets to start of each record at end of page
Formal name: NSM (N-ary Storage Model)
Current Scheme: Slotted Pages
CACHE
MAIN MEMORY
1237RH1PAGE HEADER
30Jane RH2 4322 John
45 RH3 Jim 20
RH4
7658 52
1563
block 130Jane RH
52 2534 Leon block 4
Jim 20 RH4 block 3
45 RH3 1563 block 2
select namefrom Rwhere age > 50
NSM pushes non-referenced data to the cache
2534 LeonSusan
Predicate Evaluation using NSM
Need New Data Page Layout
Eliminates unnecessary memory accesses Improves inter-record locality Keeps a record’s fields together Does not affect I/O performance
and, most importantly, is…
low-implementation-cost, high-impact
1237RH1PAGE HEADER
30Jane RH2 4322 John
45
1563
RH3 Jim 20
RH4
7658 Susan 52
PAGE HEADER 1237 4322
1563
7658
Jane John Jim Susan
30 45 2052
NSM PAGE PAX PAGE
Partition data within the page for spatial locality
Partition Attributes Across (PAX)
CACHE
1563
PAGE HEADER 1237 4322
7658
Jane John Jim Suzan
30 45 2052
block 130 45 2052
MAIN MEMORY
select namefrom Rwhere age > 50
Fewer cache misses, low reconstruction cost
Predicate Evaluation using PAX
FIXED-LENGTH VALUES VARIABLE-LENGTH VALUESHEADER
offsets to variable-length fields
null bitmap,record length, etc
NSM: All fields of record stored together + slots
A Real NSM Record
pid 3 2 4v4
43221237
Jane John
1 1
30 45
1 1
f }
}Page Header
attribute sizes
free space# records
# attributes
F - Minipage
presence bits
presence bits
v-offsets
}}
F - Minipage
V - Minipage
PAX: Detailed Design
PAX: Group fields + amortizes record headers
Outline
The memory/processor speed gap What’s wrong with slotted pages? Partition Attributes Across (PAX) Performance results Summary
Main-memory resident R, numeric fields Query:
select avg (ai)
from R
where aj >= Lo and aj <= Hi
PII Xeon running Windows NT 4 16KB L1-I, 16KB L1-D, 512 KB L2, 512 MB RAM Used processor counters Implemented schemes on Shore Storage Manager
Similar behavior to commercial Database Systems
Sanity Check: Basic Evaluation
Execution time breakdown
0%
20%
40%
60%
80%
100%
A B C D ShoreDBMS
% e
xecu
tion
tim
e
Computation Memory Branch mispr. Resource
Memory stall time breakdown
0%
20%
40%
60%
80%
100%
A B C D ShoreDBMS
Me
mo
ry s
tall
time
(%
)
L1 Data L2 Data L1 Instruction L2 Instruction
We can use Shore to evaluate DSS workload behavior
Compare Shore query behavior with commercial DBMS Execution time & memory delays (range selection)
Why Use Shore?
Sensitivity to Selectivity
0
20
40
60
80
100
120
140
160
1% 5% 10% 20% 50% 100%
selectivity
sta
ll c
yc
les
/ re
co
rd
NSM L2
PAX L2
PAX saves 70% of NSM’s data cache penalty PAX reduces cache misses at both L1 and L2 Selectivity doesn’t matter for PAX data stalls
Effect on Accessing Cache Data
Cache data stalls
0
20
40
60
80
100
120
140
160
NSM PAX
page layout
sta
ll c
yc
les
/ re
co
rd
L1 Data stalls
L2 Data stalls
PAX: 75% less memory penalty than NSM (10% of time) Execution times converge as number of attrs increases
Execution time breakdown
0
300
600
900
1200
1500
1800
NSM PAX
page layout
clo
ck
cy
cle
s p
er
rec
ord
HardwareResource
BranchMispredict
Memory
Computation
Sensitivity to # of attributes
0
1
2
3
4
5
6
2 4 8 16 32 64
# of attributes in recorde
lap
se
d t
ime
(s
ec
)
NSM
PAX
Time and Sensitivity Analysis
100M, 200M, and 500M TPC-H DBs Queries:
1. Range Selections w/ variable parameters (RS)2. TPC-H Q1 and Q6
sequential scans lots of aggregates (sum, avg, count) grouping/ordering of results
3. TPC-H Q12 and Q14 (Adaptive Hybrid) Hash Join complex ‘where’ clause, conditional aggregates
128MB buffer pool
Evaluation Using a DSS Benchmark
PAX/NSM Speedup on PII/NT
0%
15%
30%
45%
60%
RS Q1 Q6 Q12 Q14Query
PA
X/N
SM
Sp
ee
du
p
100 MB200 MB500 MB
PAX improves performance even with I/O Speedup differs across DB sizes
TPC-H Queries: Speedup
Updates
Policy: Update in-place Variable-length: Shift when needed PAX only needs shift minipage data
Update statement:update R
set ap=ap + b
where aq > Lo and aq < Hi
PAX/NSM Speedup on PII/NT
0%
2%
4%
6%
8%
10%
12%
14%
16%
18%
1 2 3 4 5 6 7
Number of updated attributes
PA
X/N
SM
Sp
ee
du
p
2%10%20%50%100%
PAX always speeds queries up (7-17%) Lower selectivity => reads dominate speedup High selectivity => speedup dominated by write-backs
Updates: Speedup
PAX: a low-cost, high-impact DP technique
Performance Eliminates unnecessary memory references High utilization of cache space/bandwidth Faster than NSM (does not affect I/O)
Usability Orthogonal to other storage decisions “Easy” to implement in large existing DBMSs
Summary