© 2017 InfluxData. All rights reserved. 1
Inside the InfluxDB Storage Engine
Gianluca [email protected]
@gianarb
© 2017 InfluxData. All rights reserved. 2
© 2017 InfluxData. All rights reserved. 3
What is time series data?
© 2017 InfluxData. All rights reserved. 4
Stock trades and quotes
© 2017 InfluxData. All rights reserved. 5
Metrics
© 2017 InfluxData. All rights reserved. 6
Analytics
© 2017 InfluxData. All rights reserved. 7
Events
© 2017 InfluxData. All rights reserved. 8
Sensor data
Traces
© 2017 InfluxData. All rights reserved. 10
Two kinds of time series data…
© 2017 InfluxData. All rights reserved. 11
Regular time series
t0 t1 t2 t3 t4 t6 t7
Samples at regular intervals
© 2017 InfluxData. All rights reserved. 12
Irregular time series
t0 t1 t2 t3 t4 t6 t7
Events whenever they come in
© 2017 InfluxData. All rights reserved. 13
Why would you want a database for time series
data?
© 2017 InfluxData. All rights reserved. 14
Scale
© 2017 InfluxData. All rights reserved. 15
Example from server monitoring
• 2,000 servers, VMs, containers, or sensor units
• 1,000 measurements per server/unit
• every 10 seconds
• = 17,280,000,000 distinct points per day
© 2017 InfluxData. All rights reserved. 16
Compression
© 2017 InfluxData. All rights reserved. 17
Aging out data
© 2017 InfluxData. All rights reserved. 18
Downsampling
© 2017 InfluxData. All rights reserved. 19
Fast range queries
Two Databases…
© 2017 InfluxData. All rights reserved. 21
TSDB
© 2017 InfluxData. All rights reserved. 22
Inverted Index
preliminary intro materials…
© 2017 InfluxData. All rights reserved. 24
Everything is indexed by time and series
© 2017 InfluxData. All rights reserved. 25
Shards
10/11/2015 10/12/2015
Data organized into Shards of time, each is an underlying DBefficient to drop old data
10/13/201510/10/2015
© 2017 InfluxData. All rights reserved. 26
InfluxDB data
temperature,device=dev1,building=b1 internal=80,external=18 1443782126
© 2017 InfluxData. All rights reserved. 27
InfluxDB data
Measurement
temperature,device=dev1,building=b1 internal=80,external=18 1443782126
© 2017 InfluxData. All rights reserved. 28
InfluxDB data
Measurement Tags
temperature,device=dev1,building=b1 internal=80,external=18 1443782126
© 2017 InfluxData. All rights reserved. 29
InfluxDB data
Measurement Tags Fields
temperature,device=dev1,building=b1 internal=80,external=18 1443782126
© 2017 InfluxData. All rights reserved. 30
InfluxDB data
temperature,device=dev1,building=b1 internal=80,external=18 1443782126
Measurement Tags(tagset all together)
Fields Timestamp
© 2017 InfluxData. All rights reserved. 31
InfluxDB data
temperature,device=dev1,building=b1 internal=80,external=18 1443782126
Measurement Fields Timestamp
We actually store up to ns scale timestampsbut I couldn’t fit on the slide
Tags(tagset all together)
© 2017 InfluxData. All rights reserved. 32
Each series and field to a unique ID
temperature,device=dev1,building=b1#internal
temperature,device=dev1,building=b1#external
1
2
© 2017 InfluxData. All rights reserved. 33
Data per ID is tuples ordered by time
temperature,device=dev1,building=b1#internal
temperature,device=dev1,building=b1#external
1
2
1 (1443782126,80)
2 (1443782126,18)
© 2017 InfluxData. All rights reserved. 34
Arranging in Key/Value Stores
1,1443782126
Key Value
80
ID Time
© 2017 InfluxData. All rights reserved. 35
Arranging in Key/Value Stores
1,1443782126
Key Value
802,1443782126 18
© 2017 InfluxData. All rights reserved. 36
Arranging in Key/Value Stores
1,1443782126
Key Value
80
2,1443782126 18
1,1443782127 81 new data
© 2017 InfluxData. All rights reserved. 37
Arranging in Key/Value Stores
1,1443782126
Key Value
80
2,1443782126 18
1,1443782127 81key spaceis ordered
© 2017 InfluxData. All rights reserved. 38
Arranging in Key/Value Stores
1,1443782126
Key Value
80
2,1443782126 18
1,1443782127 81
2,1443782256 15
2,1443782130 17
3,1443700126 18
Many existing storage engines have this model
© 2017 InfluxData. All rights reserved. 40
New Storage Engine?!
© 2017 InfluxData. All rights reserved. 41
First we used LSM Trees
© 2017 InfluxData. All rights reserved. 42
deletes expensive
© 2017 InfluxData. All rights reserved. 43
too many open file handles
© 2017 InfluxData. All rights reserved. 44
Then mmap COW B+Trees
© 2017 InfluxData. All rights reserved. 45
write throughput
© 2017 InfluxData. All rights reserved. 46
compression
© 2017 InfluxData. All rights reserved. 47
met our requirements
© 2017 InfluxData. All rights reserved. 48
High write throughput
© 2017 InfluxData. All rights reserved. 49
Awesome read performance
© 2017 InfluxData. All rights reserved. 50
Better Compression
© 2017 InfluxData. All rights reserved. 51
Writes can’t block reads
© 2017 InfluxData. All rights reserved. 52
Reads can’t block writes
© 2017 InfluxData. All rights reserved. 53
Write multiple ranges simultaneously
Hot backups
© 2017 InfluxData. All rights reserved. 55
Many databases open in a single process
© 2017 InfluxData. All rights reserved. 56
Enter InfluxDB’sTime Structured Merge Tree
(TSM Tree)
© 2017 InfluxData. All rights reserved. 57
Enter InfluxDB’sTime Structured Merge Tree
(TSM Tree)like LSM, but different
© 2017 InfluxData. All rights reserved. 58
Components
WALIn
memorycache
IndexFiles
© 2017 InfluxData. All rights reserved. 59
Components
WALIn
memorycache
IndexFiles
Similar to LSM Trees
© 2017 InfluxData. All rights reserved. 60
Components
WALIn
memorycache
IndexFiles
Similar to LSM Trees
Same
© 2017 InfluxData. All rights reserved. 61
Components
WALIn
memorycache
IndexFiles
Similar to LSM Trees
Same like MemTables
© 2017 InfluxData. All rights reserved. 62
Components
WALIn
memorycache
IndexFiles
Similar to LSM Trees
Same like MemTables like SSTables
© 2017 InfluxData. All rights reserved. 63
awesome time series data
WAL (an append only file)
© 2017 InfluxData. All rights reserved. 64
awesome time series data
WAL (an append only file)
in memory index
© 2017 InfluxData. All rights reserved. 65
awesome time series data
WAL (an append only file)
in memory index
on disk index
(periodic flushes)
© 2017 InfluxData. All rights reserved. 66
awesome time series data
WAL (an append only file)
in memory index
on disk index
(periodic flushes)
Memory mapped!
© 2017 InfluxData. All rights reserved. 67
TSM File
© 2017 InfluxData. All rights reserved. 68
TSM File
© 2017 InfluxData. All rights reserved. 69
TSM File
© 2017 InfluxData. All rights reserved. 70
TSM File
© 2017 InfluxData. All rights reserved. 71
TSM File
© 2017 InfluxData. All rights reserved. 72
TSM File
© 2017 InfluxData. All rights reserved. 73
Compression
© 2017 InfluxData. All rights reserved. 74
Timestamps: encoding based on precision and deltas
© 2017 InfluxData. All rights reserved. 75
Timestamps (best case):Run length encoding
Deltas are all the same for a block
© 2017 InfluxData. All rights reserved. 76
Timestamps (good case): Simple8B
Ann and Moffat in "Index compression using 64-bit words"
© 2017 InfluxData. All rights reserved. 77
Timestamps (worst case):raw values
nano-second timestamps with large deltas
© 2017 InfluxData. All rights reserved. 78
float64: double deltaFacebook’s Gorilla - google: gorilla time series facebook
https://github.com/dgryski/go-tsz
© 2017 InfluxData. All rights reserved. 79
booleans are bits!
© 2017 InfluxData. All rights reserved. 80
int64 uses double delta, zig-zagzig-zag same as from Protobufs
© 2017 InfluxData. All rights reserved. 81
string uses Snappysame compression LevelDB uses
(might add dictionary compression)
© 2017 InfluxData. All rights reserved. 82
UpdatesWrite, resolve at query
© 2017 InfluxData. All rights reserved. 83
Deletestombstone, resolve at query & compaction
© 2017 InfluxData. All rights reserved. 84
Compactions
• Combine multiple TSM files
• Put all series points into same file
• Series points in 1k blocks
• Multiple levels
• Full compaction when cold for writes
© 2017 InfluxData. All rights reserved. 85
Example Query
select percentile(90, value) from cpuwhere time > now() - 12h and “region” = ‘west’group by time(10m), host
© 2017 InfluxData. All rights reserved. 86
Example Query
select percentile(90, value) from cpuwhere time > now() - 12h and “region” = ‘west’group by time(10m), host
How to map to series?
© 2017 InfluxData. All rights reserved. 87
Inverted Index!
© 2017 InfluxData. All rights reserved. 88
Inverted Indexcpu,host=A,region=west#idle -> 1cpu,host=B,region=west#idle -> 2 series to ID
© 2017 InfluxData. All rights reserved. 89
Inverted Indexcpu,host=A,region=west#idle -> 1cpu,host=B,region=west#idle -> 2
cpu -> [idle] measurement to fields
series to ID
© 2017 InfluxData. All rights reserved. 90
Inverted Indexcpu,host=A,region=west#idle -> 1cpu,host=B,region=west#idle -> 2
cpu -> [idle]
host -> [A, B]
measurement to fields
host to values
series to ID
© 2017 InfluxData. All rights reserved. 91
Inverted Indexcpu,host=A,region=west#idle -> 1cpu,host=B,region=west#idle -> 2
cpu -> [idle]
host -> [A, B]
region -> [west]
measurement to fields
host to values
region to values
series to ID
© 2017 InfluxData. All rights reserved. 92
Inverted Indexcpu,host=A,region=west#idle -> 1cpu,host=B,region=west#idle -> 2
cpu -> [idle]
host -> [A, B]
region -> [west]
cpu -> [1, 2]host=A -> [1]host=B -> [1]region=west -> [1, 2]
measurement to fields
host to values
region to values
series to ID
postings lists
© 2017 InfluxData. All rights reserved. 93
Index V1
• In-memory
• Load on boot
• Memory constrained
• Slower boot times with high cardinality
© 2017 InfluxData. All rights reserved. 94
Index V2
© 2017 InfluxData. All rights reserved. 95
in memory index on disk index (do we already have?)
time series meta data
© 2017 InfluxData. All rights reserved. 96
in memory index on disk index (do we already have?)
time series meta data
nope
WAL (an append only file)
© 2017 InfluxData. All rights reserved. 97
in memory index on disk index (do we already have?)
time series meta data
nope
WAL (an append only file)
on disk indices
(periodic flushes)
© 2017 InfluxData. All rights reserved. 98
in memory index on disk index (do we already have?)
time series meta data
nope
WAL (an append only file)
on disk indices
(periodic flushes)
(compactions)
on disk index
© 2017 InfluxData. All rights reserved. 99
© 2017 InfluxData. All rights reserved. 100
Index File Layout
© 2017 InfluxData. All rights reserved. 101
© 2017 InfluxData. All rights reserved. 102
© 2017 InfluxData. All rights reserved. 103
Example Key Exists Lookup
[ 76, 234, 129, 352 ] File locations
© 2017 InfluxData. All rights reserved. 104
[ 76, 234, 129, 352 ]
cpu,host=serverA,region=west#idle
© 2017 InfluxData. All rights reserved. 105
[ 76, 234, 129, 352 ]
cpu,host=serverA,region=west#idle
© 2017 InfluxData. All rights reserved. 106
Robin Hood Hashing
• Can fully load table
• No linked lists for lookup
• Perfect for read-only hashes
© 2017 InfluxData. All rights reserved. 107
[ , , , , ]
[ 0, 1, 2, 3, 4 ] Positions
[ 0, 0, 0, 0, 0 ]
Keys
Probe Lengths
© 2017 InfluxData. All rights reserved. 108
[ , , , , ]
[ 0, 1, 2, 3, 4 ] Positions
[ 0, 0, 0, 0, 0 ]
Keys
Probe Lengths
A -> 0
© 2017 InfluxData. All rights reserved. 109
[ A, , , , ]
[ 0, 1, 2, 3, 4 ] Positions
[ 0, 0, 0, 0, 0 ]
Keys
Probe Lengths
A -> 0
© 2017 InfluxData. All rights reserved. 110
[ A, , , , ]
[ 0, 1, 2, 3, 4 ] Positions
[ 0, 0, 0, 0, 0 ]
Keys
Probe Lengths
B -> 1
© 2017 InfluxData. All rights reserved. 111
[ A, B, , , ]
[ 0, 1, 2, 3, 4 ] Positions
[ 0, 0, 0, 0, 0 ]
Keys
Probe Lengths
B -> 1
© 2017 InfluxData. All rights reserved. 112
[ A, B, , , ]
[ 0, 1, 2, 3, 4 ] Positions
[ 0, 0, 0, 0, 0 ]
Keys
Probe Lengths
C -> 1
© 2017 InfluxData. All rights reserved. 113
[ A, B, C, , ]
[ 0, 1, 2, 3, 4 ] Positions
[ 0, 0, 0, 0, 0 ]
Keys
Probe Lengths
C -> 2
© 2017 InfluxData. All rights reserved. 114
[ A, B, C, , ]
[ 0, 1, 2, 3, 4 ] Positions
[ 0, 0, 1, 0, 0 ]
Keys
Probe Lengths
C -> probe 1
© 2017 InfluxData. All rights reserved. 115
[ A, B, C, , ]
[ 0, 1, 2, 3, 4 ] Positions
[ 0, 0, 1, 0, 0 ]
Keys
Probe Lengths
D -> 0
© 2017 InfluxData. All rights reserved. 116
[ A, B, C, , ]
[ 0, 1, 2, 3, 4 ] Positions
[ 0, 0, 1, 0, 0 ]
Keys
Probe Lengths
D -> probe 1
© 2017 InfluxData. All rights reserved. 117
[ A, D, C, , ]
[ 0, 1, 2, 3, 4 ] Positions
[ 0, 1, 1, 0, 0 ]
Keys
Probe Lengths
B -> probe 1
© 2017 InfluxData. All rights reserved. 118
[ A, D, C, , ]
[ 0, 1, 2, 3, 4 ] Positions
[ 0, 1, 1, 0, 0 ]
Keys
Probe Lengths
B -> probe 2
© 2017 InfluxData. All rights reserved. 119
[ A, D, C, B, ]
[ 0, 1, 2, 3, 4 ] Positions
[ 0, 1, 1, 2, 0 ]
Keys
Probe Lengths
B -> probe 2
© 2017 InfluxData. All rights reserved. 120
Rob probe rich, give to probe poor
© 2017 InfluxData. All rights reserved. 121
Refinement: average probe
© 2017 InfluxData. All rights reserved. 122
Cache Hit
[ A, D, C, B, ]
[ 0, 1, 2, 3, 4 ] Positions
[ 0, 1, 1, 2, 0 ]
Keys
Probe LengthsAverage: 1
© 2017 InfluxData. All rights reserved. 123
Cache Hit
[ A, D, C, B, ]
[ 0, 1, 2, 3, 4 ] Positions
[ 0, 1, 1, 2, 0 ]
Keys
Probe LengthsAverage: 1
D -> hashes to 0 + 1
© 2017 InfluxData. All rights reserved. 124
Cache Miss
[ A, D, C, B, ]
[ 0, 1, 2, 3, 4 ] Positions
[ 0, 1, 1, 2, 0 ]
Keys
Probe Lengths
Z -> hashes to 0
© 2017 InfluxData. All rights reserved. 125
Cache Miss
[ A, D, C, B, ]
[ 0, 1, 2, 3, 4 ] Positions
[ 0, 1, 1, 2, 0 ]
Keys
Probe Lengths
Z -> move probe 1
© 2017 InfluxData. All rights reserved. 126
Cache Miss
[ A, D, C, B, ]
[ 0, 1, 2, 3, 4 ] Positions
[ 0, 1, 1, 2, 0 ]
Keys
Probe Lengths
Z -> move probe 2
© 2017 InfluxData. All rights reserved. 127
Cache Miss
[ A, D, C, B, ]
[ 0, 1, 2, 3, 4 ] Positions
[ 0, 1, 1, 2, 0 ]
Keys
Probe Lengths
Max Probe 2, so Z not present
© 2017 InfluxData. All rights reserved. 128
Cardinality Estimation
© 2017 InfluxData. All rights reserved. 129
HyperLogLog++