Date post: | 07-Jan-2017 |
Category: |
Data & Analytics |
Upload: | mongodb |
View: | 265 times |
Download: | 2 times |
Keith Bostic MongoDB WiredTiger team [email protected]
#MDBE16
Storage engines are performance critical
Middleware
Networking
Query APIs
mmapV1 Storage Engine
RocksDB Storage Engine
WiredTiger Storage Engine
ACID transactional guarantees
#MDBE16
WiredTiger • From (some of) the folks that brought you Berkeley DB
• High performance data engine • scalable throughput with low latency
• MongoDB’s default storage engine
• a general-purpose workhorse
#MDBE16
Each core has multiple memory caches
core 3
core 2
core 1
core N
two or more
caches
two or more
caches
two or more
caches
two or more
caches
#MDBE16
Cache coherence: cores “snoop” on writes
core 3
core 2
core 1
core N
two or more
caches
two or more
caches
two or more
caches
Main Memory
two or more
caches
#MDBE16
Traditional data engines struggle with this architecture
• Writing “shared” memory is slow • but databases exist to manage shared access to data!
• Snoopy cache-coherence scales poorly
#MDBE16
Programmers solve with locking • Locks are complex objects
• get exclusive access to the lock state • review and update the lock state • “publish” (ensure every CPU sees the changes) • release exclusive access
#MDBE16
Locking is slow
• Every operation requires exclusive access • even shared (“read”) locks require a lock/unlock cycle • thread stall is inevitable
• Locks require notification of every CPU • Locks require exclusive access to the memory bus
#MDBE16
Locking is expensive
• A lock per object is too much memory • POSIX locks cache-aligned, up to 128B • grouping objects under locks makes contention worse
• More complexity to make locks “fair” and avoid starvation • add thread queues • wake-up the next thread waiting for the lock
#MDBE16
We need to find something else If we can’t use locks, what do we use instead? Today we’re going to talk about ways to get rid of locks.
#MDBE16
WiredTiger is written in C
• Java or C++ are better choices for system programming • automatic memory management vs. malloc/free • exception handling vs. explicit error paths • widespread availability of reusable components
• Giving up programmer productivity
#MDBE16
C is “portable assembler”
• Marshall typed values to/from unaligned memory • streaming compression, encryption, checksums • unstructured I/O to/from stable storage
• Light-weight access to shared data • use the underlying machine primitives that make up locks • algorithms/structures based on those primitives
#MDBE16
Pages in the WiredTiger cache
page 2
page 6
page 8
page 9
Lots and lots (and lots) of pages MongoDB worker threads read from disk WiredTiger server threads evict to disk
#MDBE16
A reasonable page-locking implementation
• MongoDB worker threads read, modify pages • WiredTiger server threads evict pages from the cache
• Allocate a lock per page • MongoDB worker threads share pages • WiredTiger eviction threads require exclusive access
#MDBE16
Page locking in the WiredTiger cache
page 2
page 6
page 8
page 9
eviction
lock
lock
lock
lock
writer
reader thread stall on read locks! vulnerable to starvation too much memory
#MDBE16
Introducing memory barriers
• Memory barriers • order reads, writes or both across a line of code • compiler won’t cache values or reorder across a barrier
• Locks imply memory barriers
#MDBE16
Something faster
• Hazard pointers: a technique for avoiding locks • MongoDB worker threads
• “log” intention to access a page • publish: a memory barrier to ensure global CPU visibility
• Write to a per-thread memory location
• write won’t collide with other worker threads
#MDBE16
What about eviction starvation?
• Add a per-page “blocker” • MongoDB worker won’t proceed if the page is blocked
• Cheap: • it’s only a bit of information • a read-only operation for workers
#MDBE16
Worker threads
• Publish intent to access the page • Memory barrier to ensure global CPU visibility
• If the page not blocked, it’s accessible
• Clear intent to access when done
#MDBE16
Hazard pointers for workers
page 2
page 6
page 8
page 9
flag
writer
reader
flag
flag
flag
page 9
page 2
page 6
page 2
page 9
#MDBE16
Eviction server
• Block future worker thread access • Memory barrier to ensure global CPU visibility
• Review worker thread access intentions • can either wait or quit
• Unblock worker thread access when done
#MDBE16
Hazard pointers for workers and eviction
page 2
page 6
page 8
page 9
flag
flag
flag
flag
writer
reader page 9
page 2
page 6
page 2
page 9
eviction
#MDBE16
Something faster: hazard pointers
Replaces two lock/unlock pairs for each page access ... with a single memory barrier instruction.
• Transfers work to the eviction server
• MongoDB worker latency is what we care about
• Memory costs • per-worker-thread list • per-page blocking flag
#MDBE16
Introducing atomic instructions
• Atomic increment or decrement • read a value • change it and store it back without the possibility of racing
• Based on compare-and-swap (CAS) instruction • read value • update value if the value is unchanged
• but fail if the value has changed
#MDBE16
Atomic prepend to singly-linked list Update head if (and only if), head’s value is unchanged
head
NEW
new.next = head compare_and_swap(head, new.next, new)
#MDBE16
How WiredTiger uses skiplists
• WiredTiger pages start with a disk image
• a compact representation we don’t want to modify • Inserts and updates for the disk image stored in skiplists
#MDBE16
Skiplists start with a linked list Singly-linked list with sorted values: 7, 10, 13, 18, 21, 24
7 10 21 18 13 24
#MDBE16
Skiplists: add additional linked lists Each higher level “skips” over more of the list
1:7
3:7
2:7
1:10 1:21 1:18 1:13 1:24
2:13 2:21
3:21
2:24
#MDBE16
Search for 18 search starts at the top-level
1:7
3:7
2:7
1:10 1:21 1:18 1:13 1:24
2:13 2:21
3:21
2:24
#MDBE16
Search for 18 search starts at the top-level
1:7
3:7
2:7
1:10 1:21 1:18 1:13 1:24
2:13 2:21
3:21
2:24
#MDBE16
Search for 18 search starts at the top-level
1:7
3:7
2:7
1:10 1:21 1:18 1:13 1:24
2:13 2:21
3:21
2:24
#MDBE16
Search for 18 search starts at the top-level
1:7
3:7
2:7
1:10 1:21 1:18 1:13 1:24
2:13 2:21
3:21
2:24
#MDBE16
Search for 18 search starts at the top-level
1:7
3:7
2:7
1:10 1:21 1:18 1:13 1:24
2:13 2:21
3:21
2:24
#MDBE16
Search for 18 search starts at the top-level
1:7
3:7
2:7
1:10 1:21 1:18 1:13 1:24
2:13 2:21
3:21
2:24
#MDBE16
Skiplists, the great
Replaces a lock/unlock pair over the entire skiplist with one atomic memory instruction per object level
• Insert without locking • Search without locking, while inserting • Forward & backward traversal without locking, while inserting
#MDBE16
Skiplists, the good
• Simpler code than a Btree • WiredTiger binary search ~200 lines of code • a typical skiplist search < 20
• Fast search
• a Btree guarantees search in logarithmic time • skiplists don’t offer a guarantee, but are usually close
#MDBE16
Skiplists, the not-so-good
• Cache-unfriendly • every indirection a CPU cache miss
• Memory-unfriendly • needs more memory for a data set than a Btree
• Removal requires locking • WiredTiger is an MVCC engine (multiple values per key) • removal less important to WiredTiger
#MDBE16
Ticket locks
• WiredTiger still needs to lock objects • but we can make locks faster
• Ticket locks • customers take a unique ticket number • customers served in ticket order
#MDBE16
Ticket locks
• Two incrementing counters: ticket: the next available ticket number serving: the ticket number now being served
• Thread takes a ticket number • Thread increments “next available” • Thread waits for “serving” to match its ticket number • When thread finishes, increments “serving”
#MDBE16
Ticket locks are almost what we need
• Ticket locks avoid starvation and are “fair” • Smaller memory footprint • Can be made significantly faster than POSIX locks
• remember our compare-and-swap instructions!
• But POSIX locks are shared between readers
#MDBE16
Ticket locks: shared vs. exclusive
• Three incrementing counters: ticket: the next available ticket number readers: the next reader to be served writers: the next writer to be served
#MDBE16
Readers run in parallel
40
Writers Readers
39
Thread A
39
40
41
41
39
40
41
42
39
40
41
42
Thread B
Thread C
#MDBE16
Multiple variable updates without locking
• Updating multiple counters would require locking ... but we can write the bus width atomically
• Encode the entire lock state in a single 8B value lock { uint16_t readers; uint16_t writers; uint16_t ticket; // 64K simultaneous threads uint16_t unused; }
#MDBE16
That’s a (very) fast introduction.... • Hazard pointers • Skiplists • Ticket locks
Open Source implementations are available in WiredTiger, including Public Domain ticket locks.
#MDBE16
WiredTiger distribution
• Standalone application database toolkit library • key-value store (NoSQL) • row-store, column-store and LSM engines • schema layer includes data types and indexes
• Another MongoDB Open Source contribution • WiredTiger available for other applications • https://github.com/wiredtiger
Thank you! Keith Bostic [email protected]