Post on 11-Apr-2017
transcript
Chronix: Long Term Storage and Retrieval Technologyfor Anomaly Detection in Operational Data
Florian Lautenschlager,1 Michael Philippsen,2 Andreas Kumlehn,2 and Josef Adersberger11QAware GmbH, Munich, Germany 2University Erlangen-Nürnberg (FAU), Programming Systems Group, Erlangen
Chronix: Long Term Storage and Retrieval Technologyfor Anomaly Detection in Operational Data
Florian Lautenschlager,1 Michael Philippsen,2 Andreas Kumlehn,2 and Josef Adersberger11QAware GmbH, Munich, Germany 2University Erlangen-Nürnberg (FAU), Programming Systems Group, Erlangen
AbstractAnomalies in the runtime behavior of software systems, especiallyin distributed systems, are inevitable, expensive, and hard to lo-cate. To detect and correct such anomalies one has to automati-cally collect, store, and analyze the operational data of the run-time behavior, often represented as time series. There are efficientmeans both to collect and analyze the runtime behavior. Butgeneral-purpose time series databases do not focuson the specific needs of anomaly detection. Chronixis a domain specific time series database targeted atanomaly detection in operational data.
Detecting Anomalies in Running Software matters
•Resource consumption: anomalousmemory consumption, high CPU usage, . . .•Sporadic failure: blocking state,deadlock, dirty read, . . .•Security: port scanning activity,short frequent login attempts, . . .
↪→ Economic or reputation loss.
Anomaly Detection Tool Chain for Operational Data
Collection
Framework
Analysis
FrameworkTime Series Database
Collects operational data
from a running application
Asks the database for data
and analyzes the dataStores the time series data
General Purpose TSDB
• Brake shoe
• Resource hog
• Productivity obstacle
Chronix:
Domain specific TSDB
Domain specific sensors
and adaptors
Domain specific analysis
algorithms and tools
Application’s
Operational
Data
Types of operational data:
• Metrics: scalar values, e.g.,
rates, runtimes, total hits,
counters, …
• Events: single occurrences,
e.g., a user’s login, product
order, …
• Traces: sequences within a
software system, e.g., the
called methods, …
General-purpose TSDBs in Anomaly Detection
Requirements
Graphite
InfluxD
BOpenT
SDB
KairosDB
Prom
etheus
Genericdata model # G# # G# #
Analysissupport G# G# # G# G#
Lossless longterm storage # G#
No support for data types= Productivity obstacle
No support for analyses= Productivity obstacle + Brake shoe
High memory footprint= Performance hogHigh storage demands= Performance hogLoss of historical data= Brake shoe
What makes Chronix domain specific?
Option to pre-compute an extra representation of the data
Optional timestamp compression for almost-periodic time series
Records that meet the needs of the domain
Compression technique that suits the domain’s data
Underlying multi-dimensional storage
Domain specific query language with server-side evaluation
Domain specific commissioning of configuration parameters
How it works!
Example: Almost-periodic time series
Timestamp Value Metric Process Host
25.10.2016 00:00:01.546 218.34 ingester\time SmartHub QAMUC
25.10.2016 00:00:06.718 218.37 ingester\time SmartHub QAMUC
25.10.2016 00:00:11.891 218.49 ingester\time SmartHub QAMUC
25.10.2016 00:00:16.964 218.35 ingester\time SmartHub QAMUC
… … … … …
Optional Pre-compute Extras
Timestamp Value Metric Process Host SAX
25.10.2016 00:00:01.546 218.34 ingester\time SmartHub QAMUC A
25.10.2016 00:00:06.718 218.37 ingester\time SmartHub QAMUC B
25.10.2016 00:00:11.891 218.49 ingester\time SmartHub QAMUC C
25.10.2016 00:00:16.964 218.35 ingester\time SmartHub QAMUC B
… … … … … …
B
A
C
B
C
D
C
A
B
• Lossless storage that keeps all details as analyses may need them.• Programming interface to add extra domain specific "columns".• These "columns" speed up anomaly detection queries.
Optional Timestamp Compaction
Timestamp
25.10 … :01.546
25.10 … :06.718
25.10 … :11.891
25.10 … :16.964
…
Timestamp
25.10 … :01.546
5.172
5.173
5.073
…
Timestamp
25.10 … :01.546
5.172
0.001
0.1
…
Timestamp
25.10 … :01.546
5.172
-
-
…
Timestamp
25.10 … :01.546
5.172
…
space saved
space
saved
Drop diffs
below threshold
Calculate
deltas
Compute
diffs
between
them
If accumulated
drift > threshold store delta
•Date-Delta-Compaction for almost-periodic time series.• Functionally lossless as all relevant details are kept.
• Degree of inaccuracy is a configuration parameter of
Domain Specifc Records
Process Host SAX
SmartHub QAMUC A
SmartHub QAMUC B
SmartHub QAMUC C
SmartHub QAMUC B
… … …
1Record
metric: ingester\time
process: SmartHub
host: QAMUC
start: 25.10.2016 00:00:01.546
end: …
type: metric
data: Timestamp Value SAX
25.10.2016 00:00:01.546 218.34 A
5.172 218.37 B
- 218.49 C
- 218.35 B
1
BLOB
chunk
& convert
2
Timestamp Value Metric
25….:01.546 218.34 ingester\time
5.172 218.37 ingester\time
- 218.49 ingester\time
- 218.35 ingester\time
… … …
1
1 1 2
22
• Exploit repetitiveness and bundle "lines" into data chunks.• Programming interface for a specifc time series record encoding.
• Chunk size is a configuration parameter of
Domain Specific Compression
Record
metric: ingester\time
process: SmartHub
host: QAMUC
start: 25.10.2016 00:00:01.546
end: …
type: metric
data: 00105e0 e6b0 343b 9
07bc 0804 e7d508040
Record
metric: ingester\time
process: SmartHub
host: QAMUC
start: 25.10.2016 00:00:01.546
end: …
type: metric
data: Timestamp Value SAX
25.10.2016 00:00:01.546 218.34 A
5.172 218.37 B
- 218.49 C
- 218.35 B Compressed BLOB
serialize
& compress
• Lossless compression techniques minimizes the record size.• Domain data often has small increments, recurring patterns, etc.
• Choice of compression technqiue is a configuration parameter of
Multi-Dimensional Storage
Timestamp Value Metric Process Host
25.10.2016 00:00:01.546 218.34 ingester\time SmartHub QAMUC
25.10.2016 00:00:06.718 218.37 ingester\time SmartHub QAMUC
25.10.2016 00:00:11.891 218.49 ingester\time SmartHub QAMUC
25.10.2016 00:00:16.964 218.35 ingester\time SmartHub QAMUC
… … … … …
q=host:QAMUC AND metric:ingester*AND type:[metric OR trace] AND end:NOW-7MONTH
• Explorative: Users can use the attributes to find a record.•Correlating: Queries can use and combine all types.
Query Language & Server-Side eval.
Basic GraphiteInfluxDB
OpenTSDBKairosDB
PrometheusChronix
distinct × X × × × Xintegral X × × × × X
min/max/sum X X X X X Xbottom/top × X × × X X
first/last X X × X × X. . . . . . . . . . . . . . . . . . . . .
nnderivative X X × × × Xmovavg X X × × × X
divide/scale X X × X X X
High-levelsax [33] × × × × × X
fastdtw [38] × × × × × Xoutlier × × × × × X
trend × × × × × Xfrequency × × × × × X
grpsize × × × × × Xsplit × × × × × X
queryread
resultprocess
q=metric:ingester* & cf=outlier
• Basic functions & high-level built-in domain specific functions.• Plug-in interface to add functions for server-side evalution.
Domain Specific Commissioning
0102030405060708090
0 5 10 50 100 200 1000DDC threshold in ms
Rat
es in
% −
Ave
rage
in m
s
Inaccuracy RateAverage DeviationSpace Reduction
323436384042444648
32 64 128 256 512 1024Chunk size in kBytes
Tota
l acc
ess
time
in s
ec
gzipLZ4Snappy
•DDC-Threshold: 200 ms.•Compression & Chunk Size: gzip + 128 kByte.
Easily detectpattern!
Fast!
Best Values!
Select the
ideal
Compression!
Projects 1–3
Remove
Jitter!
EvaluationBenchmark ClientUbuntu 16.04.1 x64, 12 core, 32 GB Ram,
512 GB SSD
Java
Benchmark
Benchmark ServerUbuntu 16.04.1 x64, 12 core, 32 GB Ram,
512 GB SSD
Docker InfluxDB KairosDB
ChronixOpenTSDB
Queries
Time Series
HTTP
Data of 5 Industry Projects.Project 1 2 3 4 5 total
time series 1,080 8,567 4,538 500 24,055 38,740
pairs
(mio) metric 2.4 331.4 162.6 3.9 3,762.3 4,262.6
lsof 0.0 0.0 0.0 0.4 0.0 0.4strace 0.0 0.0 0.0 12.1 0.0 12.1
(a) Pairs and time series per project.
Project r 0.5 1 7 14 21 28 56 91 1801 – 3 q 15 30 30 10 5 3 1 2 0 96
q 2 11 15 8 12 5 1 2 2 584 & 5 b 1 6 5 7 2 4 4 1 2 32
h 2 6 10 8 6 6 3 2 0 43
(b) Time ranges r (days) and occurrences of queries (q) for raw data retrieval,and of queries with basic (b) and high-level (h) functions.
Memory footprint (in MBytes)
InfluxDB
OpenTSDB
KairosD
B
Chronix
Initially 33 2,726 8,763 446Import (max) 10,336 10,111 18,905 7,002Query (max) 8,269 9,712 11,230 4,792
Chronix has a 34%–69% smaller memory footprint.
Storage demands (in GBytes)
Project RawData
InfluxDB
OpenTSDB
KairosD
B
Chronix
4 1.2 0.2 0.2 0.3 0.15 107.0 10.7 16.9 26.5 8.6
total 108.2 10.9 17.1 26.8 8.7
Chronix saves 20%–68% of the storage space.
Data retrieval times (in s)
r q InfluxDB
OpenTSDB
KairosD
B
Chronix
0.5 2 4.3 2.8 4.4 0.91 11 5.2 5.6 6.6 5.37 15 34.1 17.4 26.8 7.0
14 8 36.2 14.2 25.5 4.021 12 76.5 29.8 55.0 6.028 5 7.9 3.9 5.6 0.556 1 35.4 12.4 24.1 1.291 2 47.5 15.5 33.8 1.1
180 2 96.7 36.7 66.6 1.1total 343.8 138.3 248.4 27.1
Chronix saves 80%–92% on data retrieval time.
Times for b- and h-queries (in s)
Basic (b) InfluxDB
OpenTSDB
KairosD
B
Chronix
4 Avg. 0.9 6.1 9.8 4.45 Max. 1.3 8.4 9.1 6.03 Min. 0.7 2.7 5.3 2.83 Dev. 6.7 16.7 21.1 2.35 Sum 0.7 6.0 12.0 2.04 Count 0.8 5.5 10.5 1.08 Perc. 10.2 25.8 34.5 8.6High-level (h)
12 Outlier 30.7 29.1 117.6 18.914 Trend 162.7 50.4 100.6 30.211 Freq. 47.3 23.9 45.7 16.33 GroupSize 218.9 2927.8 206.3 29.63 Split 123.1 2893.9 47.9 37.2
75 total 604.0 5996.3 620.4 159.3
Chronix saves 73%–97% on analysis times.
Typical sce
nario.
Impo
rtan
t!
Neede
dfo
r
exp
lora
tio
n!
Evaluatio
n:
Projects
4 & 5
ConclusionChronix exploits the characteristics of the domain in many ways andthus achieves better storage and query results. Chronix is open source. www.chronix.io
AcknowledgementsThis research was in part supported by the Bavarian Ministry of Economic Affairs and Media, Energy andTechnology as an IuK-grant for the project DfD – Design for Diagnosability.