Vadim TkachenkoPerconaApril’16
Percona Fractal Tree / TokuDB
2
Agenda
Why new data structure
Fractal Tree & LSM tree
Internals of Fractal Tree
When it is useful
How to use it
Why new data structure
4
Before it was B-Tree
• “Traditional” data structure• In the field from 1970-ies
5
Before there was B-Tree
6
When B-Tree is good
• When datasize doesn’t exceed memory limits• When the application is mostly performing read (SELECT)
operations, or when read performance is more important than write performance
7
When B-Tree is not good
• As soon as the data size exceeds available memory, performance drops rapidly
• Choosing a flash-based storage helps performance, but only to a certain extent -- in the long run, memory limits cause performance to suffer
8
To summarize
• B-tree was designed to provide optimal data retrieval performance, but not data updates (insert, delete, update)
• This shortcoming created a need for data structures that provide better performance for data storage.
9
Cases when B-Tree is not optimal
• accepting and storing event logs• storing measurements from a high-frequency sensor,• tracking user clicks, and so on.
• For such cases, two new data structures were created: log structured merge (LSM) trees and Fractal Trees®.
10
LSM & Fractal Tree
11
LSM tree & Fractal tree
• Shift balance from optimal reads toward faster writes
Fractal Trees
13
Fractal Trees
• Invented ~ 2007• Tokutek and TokuDB as commercial engine• 2015 – part of Percona
14
Fractal Tree
• Delay writes (send messages)• Combine multiple delayed writes into single IO• => SELECTs have much work to do• Walk through all messages
15
16
Fractal tree benefits
• Tables that have a lot of indexes (preferably non-unique indexes)
• Heavy write workload into the tables• Systems with slow storage times• Saving space when the environment storage is fast but
expensive.
17
From idea to reality
• Need concurrency-control mechanisms• Need crash safety• Need transactions, logging+recovery• Need to support multithreading.• Need to integrate with MySQL API layer
• Not everything perfect yet
Fractal Tree Internals
19
On MySQL Level:
CREATE TABLE metrics ( ts timestamp, device_id int, metric_id int, cnt int, val double, PRIMARY KEY (ts, device_id, metric_id), KEY metric_id (metric_id, ts), KEY device_id (device_id, ts))
20
Internally 3 trees
• Primary Key (ts, device_id, metric_id) => data• Key (metric_id, ts) => PK (ts, device_id, metric_id) • Key (device_id, ts) => PK (ts, device_id, metric_id)
• Notice – long PK adds overhead
21
Root Node
F – tokudb_fanout (default 16)Tokudb_block_size (default 4MB)
22
Basement node (leaf)
23
• Tokudb_read_block_size (default 64KB)• Chunk used for compression/decompression• Smaller size is better for point lookups
24
Shape your tree (settings per TABLE)
• tokudb_block_size (default 4MiB)• size of Node IN Memory (on disk it will be compressed)
• tokudb_read_block_size (default 64KiB)• size of basement node - minimal reading block size, also block size for
compression• Balance: smaller tokudb_read_block_size - better for Point Reads, but
leads for more random IO• tokudb_fanout (default 16) - defines maximal amount of
pivots per non-leaf node. (amount of pivots = tokudb_fanout-1)
25
Recommendations
tokudb_block_size:
4MiB block size is good for spinning disk.
For SSD smaller block size might be beneficial, I often use 1MiBIn reality 64-128KiB should be even better, but TokuDB does not handle these properly (performance bug: linear search of a free block in fragmented storage)
26
Recommendations
tokudb_read_block_size:Recommended to set 16KiB if you expect point queries (again, too bad this setting is per-table, not per-index)
27
How to see the shape of the tree
tokuftdump --summary
28
29
tokuftdump --summary
leaf nodes: 6797non-leaf nodes: 97Leaf size: 4,278,632,448Total size: 4,286,052,352Total uncompressed size: 6,231,518,882Messages count: 70155Messages size: 10,535,155Records count: 30000000Tree height: 2height: 0, nodes count: 6797; avg children/node: 59.364131 basement nodes: 403498; msg size: 0; disksize: 4,278,632,448; uncompressed size: 6,220,381,082; ratio: 1.453825height: 1, nodes count: 96; avg children/node: 70.802083 msg cnt: 65001; msg size: 9,756,907; disksize: 6,907,904; uncompressed size: 10,334,469; ratio: 1.496035height: 2, nodes count: 1; avg children/node: 96.000000 msg cnt: 5154; msg size: 778,248; disksize: 512,000; uncompressed size: 803,331; ratio: 1.569006
30
FT properties
• “Delay writes” for as long as possible =>• writes are amortized into 1 single big write instead of N random writes• May result in serious liability: huge amount of messages not merged to leaf-
nodes• SELECT will require traversing through all messages• Especially bad for point SELECT queries
• Remember: Primary Key or Unique Key constraints REQUIRE a HIDDEN POINT SELECT lookup
• UNIQUE KEY - Performance Killer for TokuDB• non-sequential PRIMARY KEY - Performance Killer for TokuDB
31
Implication of slow selects
• Unique keys – background checks – implicit reads• Foreign Keys – background checks (not supported in
TokuDB)• Select by index – requires two lookups
32
Covering indexes
• SELECT user_name FROM users WHERE user_email=’[email protected]’
• Instead of INDEX (user_email) =>• INDEX (user_email, user_name)
33
When to use Fractal Tree?
• Table with many indexes (better if not UNIQUE), intensive writes into this table
• Slow storage• Saving space of fast expensive storage• Less write amplification (good for SSD health)• Cloud instances are often good fit: storage either slow, or
expensive when fast.
34
Benchmarks
35
36
Stories on PerconaFT internalsSection Information
38
Eviction
• Algorithm to maintain cached nodes within limit
39
Eviction
• tokudb_cache_size - Amount of memory TokuDB allocates for nodes in memory.
• TokuDB’s term is “CACHETABLE”, status variables• show global status like '%CACHETABLE%';
• Eviction - background process to keep memory consumption <= tokudb_cache_size.• It starts in only when size_of(nodes_in_memory) > tokudb_cache_size
• TokuDB will use more memory than tokudb_cache_size, • User thread will be stopped if used memory > tokudb_cache_size*1.2
40
Eviction algorithm
CACHETABLE uses GCLOCK algorithm (not LRU) to manage nodes in memory.
Eviction algorithm in simple steps:• If size_of(nodes_in_memory) > tokudb_cache_size
Find victim to remove from memoryNode with smallest access_count is removed (evicted)If Node is DIRTY - node is sent into background process to write on diskTokudb_CACHETABLE_SIZE_WRITING - size of nodes in background write queue
• Potential memory consumption is tokudb_cache_size + Tokudb_CACHETABLE_SIZE_WRITING
41
Partial eviction
• For non-leaf non-dirty nodes Evictor may choose to perform partial eviction
• 2 stage of partial evictions:• Compress a part of node• If still not-used, remove from memory
• Variables to controls this:• tokudb_enable_partial_eviction• tokudb_compress_buffers_before_eviction
42
Partial eviction
43
44
TokuDB Compression
• Only non-compressed data stored in memory (unless partial compressed part of non-leaf node).
• It seems beneficial to use OS cache as a secondary cache for compressed nodes, for this:• tokudb_directio=OFF• USE cgroups to limit total memory usage by mysqld process
45
Checkpointing
• Checkpointing - is the periodical process to get datafiles in sync with transactional redo log files.
• show global status like '%CHECKPOINT%';
• In TokuDB checkpointing is time-based, in InnoDB - log file size based.• In InnoDB checkpointing is fuzzy. In TokuDB it starts by timer and runs until it is done.
• Checkpointing interval in TokuDB:
• tokudb_checkpointing_period=N sec
46
47
48
Checkpoint algorithm
• START CHECKPOINT; • begin_checkpoint; ←- all transactions are stalled• mark all nodes in memory as PENDING;• end_begin_checkpoint;
• Checkpoint thread: go through all PENDING nodes; if dirty - write to disk• User threads: if user query faces PENDING node; node is CLONED and put into background checkpoint
thread pool
• By default checkpoint thread pool size (amount of threads) = CPU CORES / 4.• That is 4 threads on 16 cores servers.• In CPU bound workload it takes 25% of CPU power from user threads!!!!• Variable: tokudb_checkpoint_pool_threads=N
49
Few words on LSMSection Information
51
LSM tree
• Older than Fractal Tree• Google BigTable as primary driver of interest• Cassandra• RocksDB• MongoRocks• MyRocks
52
Instead of final summary
• Alternative data structures have their place• Use wisely, know limitations• A lot of work ahead