Download - Percona FT / TokuDB

Vadim TkachenkoPerconaApril’16

Percona Fractal Tree / TokuDB

2

Agenda

Why new data structure

Fractal Tree & LSM tree

Internals of Fractal Tree

When it is useful

How to use it

Why new data structure

4

Before it was B-Tree

• “Traditional” data structure• In the field from 1970-ies

5

Before there was B-Tree

6

When B-Tree is good

• When datasize doesn’t exceed memory limits• When the application is mostly performing read (SELECT)

operations, or when read performance is more important than write performance

7

When B-Tree is not good

• As soon as the data size exceeds available memory, performance drops rapidly

• Choosing a flash-based storage helps performance, but only to a certain extent -- in the long run, memory limits cause performance to suffer

8

To summarize

• B-tree was designed to provide optimal data retrieval performance, but not data updates (insert, delete, update)

• This shortcoming created a need for data structures that provide better performance for data storage.

9

Cases when B-Tree is not optimal

• accepting and storing event logs• storing measurements from a high-frequency sensor,• tracking user clicks, and so on.

• For such cases, two new data structures were created: log structured merge (LSM) trees and Fractal Trees®.

10

LSM & Fractal Tree

11

LSM tree & Fractal tree

• Shift balance from optimal reads toward faster writes

Fractal Trees

13

Fractal Trees

• Invented ~ 2007• Tokutek and TokuDB as commercial engine• 2015 – part of Percona

14

Fractal Tree

• Delay writes (send messages)• Combine multiple delayed writes into single IO• => SELECTs have much work to do• Walk through all messages

15

16

Fractal tree benefits

• Tables that have a lot of indexes (preferably non-unique indexes)

• Heavy write workload into the tables• Systems with slow storage times• Saving space when the environment storage is fast but

expensive.

17

From idea to reality

• Need concurrency-control mechanisms• Need crash safety• Need transactions, logging+recovery• Need to support multithreading.• Need to integrate with MySQL API layer

• Not everything perfect yet

Fractal Tree Internals

19

On MySQL Level:

CREATE TABLE metrics ( ts timestamp, device_id int, metric_id int, cnt int, val double, PRIMARY KEY (ts, device_id, metric_id), KEY metric_id (metric_id, ts), KEY device_id (device_id, ts))

20

Internally 3 trees

• Primary Key (ts, device_id, metric_id) => data• Key (metric_id, ts) => PK (ts, device_id, metric_id) • Key (device_id, ts) => PK (ts, device_id, metric_id)

• Notice – long PK adds overhead

21

Root Node

F – tokudb_fanout (default 16)Tokudb_block_size (default 4MB)

22

Basement node (leaf)

23

• Tokudb_read_block_size (default 64KB)• Chunk used for compression/decompression• Smaller size is better for point lookups

24

Shape your tree (settings per TABLE)

• tokudb_block_size (default 4MiB)• size of Node IN Memory (on disk it will be compressed)

• tokudb_read_block_size (default 64KiB)• size of basement node - minimal reading block size, also block size for

compression• Balance: smaller tokudb_read_block_size - better for Point Reads, but

leads for more random IO• tokudb_fanout (default 16) - defines maximal amount of

pivots per non-leaf node. (amount of pivots = tokudb_fanout-1)

25

Recommendations

tokudb_block_size:

4MiB block size is good for spinning disk.

For SSD smaller block size might be beneficial, I often use 1MiBIn reality 64-128KiB should be even better, but TokuDB does not handle these properly (performance bug: linear search of a free block in fragmented storage)

26

Recommendations

tokudb_read_block_size:Recommended to set 16KiB if you expect point queries (again, too bad this setting is per-table, not per-index)

27

How to see the shape of the tree

tokuftdump --summary

28

29

tokuftdump --summary

leaf nodes: 6797non-leaf nodes: 97Leaf size: 4,278,632,448Total size: 4,286,052,352Total uncompressed size: 6,231,518,882Messages count: 70155Messages size: 10,535,155Records count: 30000000Tree height: 2height: 0, nodes count: 6797; avg children/node: 59.364131 basement nodes: 403498; msg size: 0; disksize: 4,278,632,448; uncompressed size: 6,220,381,082; ratio: 1.453825height: 1, nodes count: 96; avg children/node: 70.802083 msg cnt: 65001; msg size: 9,756,907; disksize: 6,907,904; uncompressed size: 10,334,469; ratio: 1.496035height: 2, nodes count: 1; avg children/node: 96.000000 msg cnt: 5154; msg size: 778,248; disksize: 512,000; uncompressed size: 803,331; ratio: 1.569006

30

FT properties

• “Delay writes” for as long as possible =>• writes are amortized into 1 single big write instead of N random writes• May result in serious liability: huge amount of messages not merged to leaf-

nodes• SELECT will require traversing through all messages• Especially bad for point SELECT queries

• Remember: Primary Key or Unique Key constraints REQUIRE a HIDDEN POINT SELECT lookup

• UNIQUE KEY - Performance Killer for TokuDB• non-sequential PRIMARY KEY - Performance Killer for TokuDB

31

Implication of slow selects

• Unique keys – background checks – implicit reads• Foreign Keys – background checks (not supported in

TokuDB)• Select by index – requires two lookups

32

Covering indexes

• SELECT user_name FROM users WHERE user_email=’[email protected]’

• Instead of INDEX (user_email) =>• INDEX (user_email, user_name)

33

When to use Fractal Tree?

• Table with many indexes (better if not UNIQUE), intensive writes into this table

• Slow storage• Saving space of fast expensive storage• Less write amplification (good for SSD health)• Cloud instances are often good fit: storage either slow, or

expensive when fast.

34

Benchmarks

35

36

Stories on PerconaFT internalsSection Information

38

Eviction

• Algorithm to maintain cached nodes within limit

39

Eviction

• tokudb_cache_size - Amount of memory TokuDB allocates for nodes in memory.

• TokuDB’s term is “CACHETABLE”, status variables• show global status like '%CACHETABLE%';

• Eviction - background process to keep memory consumption <= tokudb_cache_size.• It starts in only when size_of(nodes_in_memory) > tokudb_cache_size

• TokuDB will use more memory than tokudb_cache_size, • User thread will be stopped if used memory > tokudb_cache_size*1.2

40

Eviction algorithm

CACHETABLE uses GCLOCK algorithm (not LRU) to manage nodes in memory.

Eviction algorithm in simple steps:• If size_of(nodes_in_memory) > tokudb_cache_size

Find victim to remove from memoryNode with smallest access_count is removed (evicted)If Node is DIRTY - node is sent into background process to write on diskTokudb_CACHETABLE_SIZE_WRITING - size of nodes in background write queue

• Potential memory consumption is tokudb_cache_size + Tokudb_CACHETABLE_SIZE_WRITING

41

Partial eviction

• For non-leaf non-dirty nodes Evictor may choose to perform partial eviction

• 2 stage of partial evictions:• Compress a part of node• If still not-used, remove from memory

• Variables to controls this:• tokudb_enable_partial_eviction• tokudb_compress_buffers_before_eviction

42

Partial eviction

43

44

TokuDB Compression

• Only non-compressed data stored in memory (unless partial compressed part of non-leaf node).

• It seems beneficial to use OS cache as a secondary cache for compressed nodes, for this:• tokudb_directio=OFF• USE cgroups to limit total memory usage by mysqld process

45

Checkpointing

• Checkpointing - is the periodical process to get datafiles in sync with transactional redo log files.

• show global status like '%CHECKPOINT%';

• In TokuDB checkpointing is time-based, in InnoDB - log file size based.• In InnoDB checkpointing is fuzzy. In TokuDB it starts by timer and runs until it is done.

• Checkpointing interval in TokuDB:

• tokudb_checkpointing_period=N sec

46

47

48

Checkpoint algorithm

• START CHECKPOINT; • begin_checkpoint; ←- all transactions are stalled• mark all nodes in memory as PENDING;• end_begin_checkpoint;

• Checkpoint thread: go through all PENDING nodes; if dirty - write to disk• User threads: if user query faces PENDING node; node is CLONED and put into background checkpoint

thread pool

• By default checkpoint thread pool size (amount of threads) = CPU CORES / 4.• That is 4 threads on 16 cores servers.• In CPU bound workload it takes 25% of CPU power from user threads!!!!• Variable: tokudb_checkpoint_pool_threads=N

49

Few words on LSMSection Information

51

LSM tree

• Older than Fractal Tree• Google BigTable as primary driver of interest• Cassandra• RocksDB• MongoRocks• MyRocks

52

Instead of final summary

• Alternative data structures have their place• Use wisely, know limitations• A lot of work ahead

53

Thank you!

• My contact• [email protected]• @VadimTK

mailto:[email protected]