Date post: | 17-Dec-2015 |
Category: |
Documents |
Upload: | clarissa-copeland |
View: | 229 times |
Download: | 2 times |
Improving Database Performance on Simultaneous Multithreading Processors
Jingren ZhouMicrosoft Research
John CieslewiczColumbia University
Kenneth A. RossColumbia University
Mihir ShahColumbia University
Simultaneous Multithreading (SMT)
Available on modern CPUs: “Hyperthreading” on
Pentium 4 and Xeon. IBM POWER5 Sun UltraSparc IV
Challenge: Design software to efficiently utilize SMT. This talk: Database
software
Intel Pentium 4 with Hyperthreading
Superscalar Processor (no SMT)
Instruction Stream
Superscalar pipeline (up to 2 instructions/cycle)
... ...... ...
Time
•Improved instruction level parallelism
CPI = 3/4
SMT Processor
Instruction Streams
... ...... ...
Time
•Improved thread level parallelism
•More opportunities to keep the processor busy
•But sometimes SMT does not work so well
CPI = 5/8
... ...
StallsInstruction Stream 1
... ...
... ... Time
Instruction Stream 2... ...CPI = 3/4 . Progress despite stalled thread.
Stall
Stalls due to cache misses (200-300 cycles for L2 cache), branch mispredictions (20-30 cycles), etc.
Memory ConsistencyInstruction Stream 1
... ...
... ... Time
Instruction Stream 2... ...Detect conflicting access to common cache line
flush pipeline + sync cache with RAM
“MOMC Event” on Pentium 4. (300-350 cycles)
SMT Processor
Exposes multiple “logical” CPUs (one per instruction stream)
One physical CPU (~5% extra silicon to duplicate thread state information)
Better than single threading: Increased thread-level parallelism Improved processor utilization when one thread blocks
Not as good as two physical CPUs: CPU resources are shared, not replicated
SMT Challenges
Resource Competition Shared Execution Units Shared Cache
Thread Coordination Locking, etc. has high overhead
False Sharing MOMC Events
Approaches to using SMT
Ignore it, and write single threaded code. Naïve parallelism
Pretend the logical CPUs are physical CPUs
SMT-aware parallelism Parallel threads designed to avoid SMT-related
interference
Use one thread for the algorithm, and another to manage resources E.g., to avoid stalls for cache misses
Naïve Parallelism
Treat SMT processor as if it is multi-core Databases already designed to utilize multiple
processors - no code modification Uses shared processor resources inefficiently:
Cache Pollution / Interference Competition for execution units
SMT-Aware Parallelism
Exploit intra-operator parallelism Divide input and use a separate thread to process
each part E.g., one thread for even tuples, one for odd tuples. Explicit partitioning step not required.
Sharing input involves multiple readers No MOMC events, because two reads don’t conflict
SMT-Aware Parallelism (cont.)
Sharing output is challenging Thread coordination for output read/write and write/write conflicts on common cache
lines (MOMC Events)
“Solution:” Partition the output Each thread writes to separate memory buffer to avoid
memory conflicts Need an extra merge step in the consumer of the output
stream Difficult to maintain input order in the output
Managing Resources for SMT
Cache misses are a well-known performance bottleneck for modern database systems Mainly L2 data cache misses, but also L1 instruction
cache misses [Ailamaki et al 98].
Goal: Use a “helper” thread to avoid cache misses in the “main” thread load future memory references into the cache explicit load, not a prefetch
Data Dependency
Memory references that depend upon a previous memory access exhibit a data dependency
E.g., Lookup hash table:
Hash Buckets Overflow Cells
Tuple
Data Dependency (cont.) Data dependencies make instruction level parallelism
harder Modern architectures provide prefetch instructions.
Request that data be brought into the cache Non-blocking
Pitfalls: Prefetch instructions are frequently dropped Difficult to tune Too much prefetching can pollute the cache
Staging Computation
Hash Buckets Overflow Cells Tuple
A B C
1. Preload A.
2. (other work)
3. Process A.
4. Preload B.
5. (other work)
6. Process B.
7. Preload C.
8. (other work)
9. Process C.
10. Preload Tuple.
11. (other work)
12. Process Tuple.(Assumes each element is a cache line.)
Staging Computation (cont.)
By overlapping memory latency with other work, some cache miss latency can be hidden.
Many probes “in flight” at the same time. Algorithms need to be rewritten. E.g. Chen, et al. [2004],
Harizopoulos, et al. [2004].
Work-Ahead Set: Main Thread
Writes memory address + computation state to the work-ahead set
Retrieves a previous address + state Hope that helper thread can preload data before
retrieval by the main thread Correct whether or not helper thread succeeds at
preloading data helper thread is read-only
Work-ahead Set Data Structure
state address
Main Thread
A1
B1
C1
D1
E1
F1
Work-ahead Set Data Structure
state address
Main Thread
A1
B1
C1
D1
E1
F1
G1
H2
I2
J2
K2
L2
Work-Ahead Set: Helper Thread
Reads memory addresses from the work-ahead set, and loads their contents
Data becomes cache resident Tries to preload data before main thread cycles
around If successful, main thread experiences cache hits
G
H2
I2
J2
Work-ahead Set Data Structure
state address
E
F
1
1
1
Helper Thread
“ temp += *slot[i] ”
G
H2
I2
J2
Iterate Backwards!state address
E
F
1
1
1
Helper Thread
i = i-1 mod size
i
Why? See Paper.
Helper Thread Speed If helper thread faster than main thread:
More computation than memory latency Helper thread should not preload twice (wasted CPU
cycles) See paper for how to stop redundant loads
If helper thread is slower: No special tuning necessary Main thread will absorb some cache misses
Work-Ahead Set Size Too Large: Cache Pollution
Preloaded data evicts other preloaded data before it can be used
Too Small: Thread Contention Many MOMC events because work-ahead set spans few
cache lines
Just Right: Experimentally determined But use the smallest size within the acceptable range
(performance plateaus), so that cache space is available for other purposes (for us, 128 entries)
Data structure itself much smaller than L2 cache
Experimental Workload
Two Operators: Probe phase of Hash Join CSB+ Tree Index Join
Operators run in isolation and in parallel
Intel VTune used to measure hardware events
CPUPentium 4
3.4 GHz
Memory 2 GB DDR
L1, L2 Size 8 KB, 512 KB
L1, L2
Cache-line Size64 B, 128 B
L1 Miss Latency 18 cycles
L2 Miss Latency
276 Cycles
MOMC Latency
~300+ Cycles
Experimental Outline
1. Hash join2. Index lookup3. Mixed: Hash join and index lookup
Hash JoinComparative Performance
Hash JoinL2 Cache Misses Per Tuple
CSB+ Tree Index JoinComparative Performance
CSB+ Tree Index JoinL2 Cache Misses Per Tuple
Parallel Operator Performance
52% 55%20%
Parallel Operator Performance
26% 29%
Conclusion
Naïve parallel SMT-Aware Work-Ahead
Impl. Effort Small Moderate Moderate
Data Format Unchanged Split output Unchanged
Data Order Unchanged Changed Unchanged*
Performance (row) Moderate High High
Performance (col) Moderate High Moderate
Control of Cache No No Yes