Inside the InfluxDB Storage Engine · © 2017 InfluxData. All rights reserved. 1 Inside the...

Post on 03-Jun-2020

30 views 2 download

transcript

© 2017 InfluxData. All rights reserved. 1

Inside the InfluxDB Storage Engine

Gianluca Arbezzanogianluca@influxdb.com

@gianarb

© 2017 InfluxData. All rights reserved. 2

© 2017 InfluxData. All rights reserved. 3

What is time series data?

© 2017 InfluxData. All rights reserved. 4

Stock trades and quotes

© 2017 InfluxData. All rights reserved. 5

Metrics

© 2017 InfluxData. All rights reserved. 6

Analytics

© 2017 InfluxData. All rights reserved. 7

Events

© 2017 InfluxData. All rights reserved. 8

Sensor data

Traces

© 2017 InfluxData. All rights reserved. 10

Two kinds of time series data…

© 2017 InfluxData. All rights reserved. 11

Regular time series

t0 t1 t2 t3 t4 t6 t7

Samples at regular intervals

© 2017 InfluxData. All rights reserved. 12

Irregular time series

t0 t1 t2 t3 t4 t6 t7

Events whenever they come in

© 2017 InfluxData. All rights reserved. 13

Why would you want a database for time series

data?

© 2017 InfluxData. All rights reserved. 14

Scale

© 2017 InfluxData. All rights reserved. 15

Example from server monitoring

• 2,000 servers, VMs, containers, or sensor units

• 1,000 measurements per server/unit

• every 10 seconds

• = 17,280,000,000 distinct points per day

© 2017 InfluxData. All rights reserved. 16

Compression

© 2017 InfluxData. All rights reserved. 17

Aging out data

© 2017 InfluxData. All rights reserved. 18

Downsampling

© 2017 InfluxData. All rights reserved. 19

Fast range queries

Two Databases…

© 2017 InfluxData. All rights reserved. 21

TSDB

© 2017 InfluxData. All rights reserved. 22

Inverted Index

preliminary intro materials…

© 2017 InfluxData. All rights reserved. 24

Everything is indexed by time and series

© 2017 InfluxData. All rights reserved. 25

Shards

10/11/2015 10/12/2015

Data organized into Shards of time, each is an underlying DBefficient to drop old data

10/13/201510/10/2015

© 2017 InfluxData. All rights reserved. 26

InfluxDB data

temperature,device=dev1,building=b1 internal=80,external=18 1443782126

© 2017 InfluxData. All rights reserved. 27

InfluxDB data

Measurement

temperature,device=dev1,building=b1 internal=80,external=18 1443782126

© 2017 InfluxData. All rights reserved. 28

InfluxDB data

Measurement Tags

temperature,device=dev1,building=b1 internal=80,external=18 1443782126

© 2017 InfluxData. All rights reserved. 29

InfluxDB data

Measurement Tags Fields

temperature,device=dev1,building=b1 internal=80,external=18 1443782126

© 2017 InfluxData. All rights reserved. 30

InfluxDB data

temperature,device=dev1,building=b1 internal=80,external=18 1443782126

Measurement Tags(tagset all together)

Fields Timestamp

© 2017 InfluxData. All rights reserved. 31

InfluxDB data

temperature,device=dev1,building=b1 internal=80,external=18 1443782126

Measurement Fields Timestamp

We actually store up to ns scale timestampsbut I couldn’t fit on the slide

Tags(tagset all together)

© 2017 InfluxData. All rights reserved. 32

Each series and field to a unique ID

temperature,device=dev1,building=b1#internal

temperature,device=dev1,building=b1#external

1

2

© 2017 InfluxData. All rights reserved. 33

Data per ID is tuples ordered by time

temperature,device=dev1,building=b1#internal

temperature,device=dev1,building=b1#external

1

2

1 (1443782126,80)

2 (1443782126,18)

© 2017 InfluxData. All rights reserved. 34

Arranging in Key/Value Stores

1,1443782126

Key Value

80

ID Time

© 2017 InfluxData. All rights reserved. 35

Arranging in Key/Value Stores

1,1443782126

Key Value

802,1443782126 18

© 2017 InfluxData. All rights reserved. 36

Arranging in Key/Value Stores

1,1443782126

Key Value

80

2,1443782126 18

1,1443782127 81 new data

© 2017 InfluxData. All rights reserved. 37

Arranging in Key/Value Stores

1,1443782126

Key Value

80

2,1443782126 18

1,1443782127 81key spaceis ordered

© 2017 InfluxData. All rights reserved. 38

Arranging in Key/Value Stores

1,1443782126

Key Value

80

2,1443782126 18

1,1443782127 81

2,1443782256 15

2,1443782130 17

3,1443700126 18

Many existing storage engines have this model

© 2017 InfluxData. All rights reserved. 40

New Storage Engine?!

© 2017 InfluxData. All rights reserved. 41

First we used LSM Trees

© 2017 InfluxData. All rights reserved. 42

deletes expensive

© 2017 InfluxData. All rights reserved. 43

too many open file handles

© 2017 InfluxData. All rights reserved. 44

Then mmap COW B+Trees

© 2017 InfluxData. All rights reserved. 45

write throughput

© 2017 InfluxData. All rights reserved. 46

compression

© 2017 InfluxData. All rights reserved. 47

met our requirements

© 2017 InfluxData. All rights reserved. 48

High write throughput

© 2017 InfluxData. All rights reserved. 49

Awesome read performance

© 2017 InfluxData. All rights reserved. 50

Better Compression

© 2017 InfluxData. All rights reserved. 51

Writes can’t block reads

© 2017 InfluxData. All rights reserved. 52

Reads can’t block writes

© 2017 InfluxData. All rights reserved. 53

Write multiple ranges simultaneously

Hot backups

© 2017 InfluxData. All rights reserved. 55

Many databases open in a single process

© 2017 InfluxData. All rights reserved. 56

Enter InfluxDB’sTime Structured Merge Tree

(TSM Tree)

© 2017 InfluxData. All rights reserved. 57

Enter InfluxDB’sTime Structured Merge Tree

(TSM Tree)like LSM, but different

© 2017 InfluxData. All rights reserved. 58

Components

WALIn

memorycache

IndexFiles

© 2017 InfluxData. All rights reserved. 59

Components

WALIn

memorycache

IndexFiles

Similar to LSM Trees

© 2017 InfluxData. All rights reserved. 60

Components

WALIn

memorycache

IndexFiles

Similar to LSM Trees

Same

© 2017 InfluxData. All rights reserved. 61

Components

WALIn

memorycache

IndexFiles

Similar to LSM Trees

Same like MemTables

© 2017 InfluxData. All rights reserved. 62

Components

WALIn

memorycache

IndexFiles

Similar to LSM Trees

Same like MemTables like SSTables

© 2017 InfluxData. All rights reserved. 63

awesome time series data

WAL (an append only file)

© 2017 InfluxData. All rights reserved. 64

awesome time series data

WAL (an append only file)

in memory index

© 2017 InfluxData. All rights reserved. 65

awesome time series data

WAL (an append only file)

in memory index

on disk index

(periodic flushes)

© 2017 InfluxData. All rights reserved. 66

awesome time series data

WAL (an append only file)

in memory index

on disk index

(periodic flushes)

Memory mapped!

© 2017 InfluxData. All rights reserved. 67

TSM File

© 2017 InfluxData. All rights reserved. 68

TSM File

© 2017 InfluxData. All rights reserved. 69

TSM File

© 2017 InfluxData. All rights reserved. 70

TSM File

© 2017 InfluxData. All rights reserved. 71

TSM File

© 2017 InfluxData. All rights reserved. 72

TSM File

© 2017 InfluxData. All rights reserved. 73

Compression

© 2017 InfluxData. All rights reserved. 74

Timestamps: encoding based on precision and deltas

© 2017 InfluxData. All rights reserved. 75

Timestamps (best case):Run length encoding

Deltas are all the same for a block

© 2017 InfluxData. All rights reserved. 76

Timestamps (good case): Simple8B

Ann and Moffat in "Index compression using 64-bit words"

© 2017 InfluxData. All rights reserved. 77

Timestamps (worst case):raw values

nano-second timestamps with large deltas

© 2017 InfluxData. All rights reserved. 78

float64: double deltaFacebook’s Gorilla - google: gorilla time series facebook

https://github.com/dgryski/go-tsz

© 2017 InfluxData. All rights reserved. 79

booleans are bits!

© 2017 InfluxData. All rights reserved. 80

int64 uses double delta, zig-zagzig-zag same as from Protobufs

© 2017 InfluxData. All rights reserved. 81

string uses Snappysame compression LevelDB uses

(might add dictionary compression)

© 2017 InfluxData. All rights reserved. 82

UpdatesWrite, resolve at query

© 2017 InfluxData. All rights reserved. 83

Deletestombstone, resolve at query & compaction

© 2017 InfluxData. All rights reserved. 84

Compactions

• Combine multiple TSM files

• Put all series points into same file

• Series points in 1k blocks

• Multiple levels

• Full compaction when cold for writes

© 2017 InfluxData. All rights reserved. 85

Example Query

select percentile(90, value) from cpuwhere time > now() - 12h and “region” = ‘west’group by time(10m), host

© 2017 InfluxData. All rights reserved. 86

Example Query

select percentile(90, value) from cpuwhere time > now() - 12h and “region” = ‘west’group by time(10m), host

How to map to series?

© 2017 InfluxData. All rights reserved. 87

Inverted Index!

© 2017 InfluxData. All rights reserved. 88

Inverted Indexcpu,host=A,region=west#idle -> 1cpu,host=B,region=west#idle -> 2 series to ID

© 2017 InfluxData. All rights reserved. 89

Inverted Indexcpu,host=A,region=west#idle -> 1cpu,host=B,region=west#idle -> 2

cpu -> [idle] measurement to fields

series to ID

© 2017 InfluxData. All rights reserved. 90

Inverted Indexcpu,host=A,region=west#idle -> 1cpu,host=B,region=west#idle -> 2

cpu -> [idle]

host -> [A, B]

measurement to fields

host to values

series to ID

© 2017 InfluxData. All rights reserved. 91

Inverted Indexcpu,host=A,region=west#idle -> 1cpu,host=B,region=west#idle -> 2

cpu -> [idle]

host -> [A, B]

region -> [west]

measurement to fields

host to values

region to values

series to ID

© 2017 InfluxData. All rights reserved. 92

Inverted Indexcpu,host=A,region=west#idle -> 1cpu,host=B,region=west#idle -> 2

cpu -> [idle]

host -> [A, B]

region -> [west]

cpu -> [1, 2]host=A -> [1]host=B -> [1]region=west -> [1, 2]

measurement to fields

host to values

region to values

series to ID

postings lists

© 2017 InfluxData. All rights reserved. 93

Index V1

• In-memory

• Load on boot

• Memory constrained

• Slower boot times with high cardinality

© 2017 InfluxData. All rights reserved. 94

Index V2

© 2017 InfluxData. All rights reserved. 95

in memory index on disk index (do we already have?)

time series meta data

© 2017 InfluxData. All rights reserved. 96

in memory index on disk index (do we already have?)

time series meta data

nope

WAL (an append only file)

© 2017 InfluxData. All rights reserved. 97

in memory index on disk index (do we already have?)

time series meta data

nope

WAL (an append only file)

on disk indices

(periodic flushes)

© 2017 InfluxData. All rights reserved. 98

in memory index on disk index (do we already have?)

time series meta data

nope

WAL (an append only file)

on disk indices

(periodic flushes)

(compactions)

on disk index

© 2017 InfluxData. All rights reserved. 99

© 2017 InfluxData. All rights reserved. 100

Index File Layout

© 2017 InfluxData. All rights reserved. 101

© 2017 InfluxData. All rights reserved. 102

© 2017 InfluxData. All rights reserved. 103

Example Key Exists Lookup

[ 76, 234, 129, 352 ] File locations

© 2017 InfluxData. All rights reserved. 104

[ 76, 234, 129, 352 ]

cpu,host=serverA,region=west#idle

© 2017 InfluxData. All rights reserved. 105

[ 76, 234, 129, 352 ]

cpu,host=serverA,region=west#idle

© 2017 InfluxData. All rights reserved. 106

Robin Hood Hashing

• Can fully load table

• No linked lists for lookup

• Perfect for read-only hashes

© 2017 InfluxData. All rights reserved. 107

[ , , , , ]

[ 0, 1, 2, 3, 4 ] Positions

[ 0, 0, 0, 0, 0 ]

Keys

Probe Lengths

© 2017 InfluxData. All rights reserved. 108

[ , , , , ]

[ 0, 1, 2, 3, 4 ] Positions

[ 0, 0, 0, 0, 0 ]

Keys

Probe Lengths

A -> 0

© 2017 InfluxData. All rights reserved. 109

[ A, , , , ]

[ 0, 1, 2, 3, 4 ] Positions

[ 0, 0, 0, 0, 0 ]

Keys

Probe Lengths

A -> 0

© 2017 InfluxData. All rights reserved. 110

[ A, , , , ]

[ 0, 1, 2, 3, 4 ] Positions

[ 0, 0, 0, 0, 0 ]

Keys

Probe Lengths

B -> 1

© 2017 InfluxData. All rights reserved. 111

[ A, B, , , ]

[ 0, 1, 2, 3, 4 ] Positions

[ 0, 0, 0, 0, 0 ]

Keys

Probe Lengths

B -> 1

© 2017 InfluxData. All rights reserved. 112

[ A, B, , , ]

[ 0, 1, 2, 3, 4 ] Positions

[ 0, 0, 0, 0, 0 ]

Keys

Probe Lengths

C -> 1

© 2017 InfluxData. All rights reserved. 113

[ A, B, C, , ]

[ 0, 1, 2, 3, 4 ] Positions

[ 0, 0, 0, 0, 0 ]

Keys

Probe Lengths

C -> 2

© 2017 InfluxData. All rights reserved. 114

[ A, B, C, , ]

[ 0, 1, 2, 3, 4 ] Positions

[ 0, 0, 1, 0, 0 ]

Keys

Probe Lengths

C -> probe 1

© 2017 InfluxData. All rights reserved. 115

[ A, B, C, , ]

[ 0, 1, 2, 3, 4 ] Positions

[ 0, 0, 1, 0, 0 ]

Keys

Probe Lengths

D -> 0

© 2017 InfluxData. All rights reserved. 116

[ A, B, C, , ]

[ 0, 1, 2, 3, 4 ] Positions

[ 0, 0, 1, 0, 0 ]

Keys

Probe Lengths

D -> probe 1

© 2017 InfluxData. All rights reserved. 117

[ A, D, C, , ]

[ 0, 1, 2, 3, 4 ] Positions

[ 0, 1, 1, 0, 0 ]

Keys

Probe Lengths

B -> probe 1

© 2017 InfluxData. All rights reserved. 118

[ A, D, C, , ]

[ 0, 1, 2, 3, 4 ] Positions

[ 0, 1, 1, 0, 0 ]

Keys

Probe Lengths

B -> probe 2

© 2017 InfluxData. All rights reserved. 119

[ A, D, C, B, ]

[ 0, 1, 2, 3, 4 ] Positions

[ 0, 1, 1, 2, 0 ]

Keys

Probe Lengths

B -> probe 2

© 2017 InfluxData. All rights reserved. 120

Rob probe rich, give to probe poor

© 2017 InfluxData. All rights reserved. 121

Refinement: average probe

© 2017 InfluxData. All rights reserved. 122

Cache Hit

[ A, D, C, B, ]

[ 0, 1, 2, 3, 4 ] Positions

[ 0, 1, 1, 2, 0 ]

Keys

Probe LengthsAverage: 1

© 2017 InfluxData. All rights reserved. 123

Cache Hit

[ A, D, C, B, ]

[ 0, 1, 2, 3, 4 ] Positions

[ 0, 1, 1, 2, 0 ]

Keys

Probe LengthsAverage: 1

D -> hashes to 0 + 1

© 2017 InfluxData. All rights reserved. 124

Cache Miss

[ A, D, C, B, ]

[ 0, 1, 2, 3, 4 ] Positions

[ 0, 1, 1, 2, 0 ]

Keys

Probe Lengths

Z -> hashes to 0

© 2017 InfluxData. All rights reserved. 125

Cache Miss

[ A, D, C, B, ]

[ 0, 1, 2, 3, 4 ] Positions

[ 0, 1, 1, 2, 0 ]

Keys

Probe Lengths

Z -> move probe 1

© 2017 InfluxData. All rights reserved. 126

Cache Miss

[ A, D, C, B, ]

[ 0, 1, 2, 3, 4 ] Positions

[ 0, 1, 1, 2, 0 ]

Keys

Probe Lengths

Z -> move probe 2

© 2017 InfluxData. All rights reserved. 127

Cache Miss

[ A, D, C, B, ]

[ 0, 1, 2, 3, 4 ] Positions

[ 0, 1, 1, 2, 0 ]

Keys

Probe Lengths

Max Probe 2, so Z not present

© 2017 InfluxData. All rights reserved. 128

Cardinality Estimation

© 2017 InfluxData. All rights reserved. 129

HyperLogLog++

Gianluca Arbezzanogianluca@influxdb.com

@gianarb

Thank you.