Online Metrics Prediction in Monitoring Systems
Matthieu Caneill, Noel De Palma, Ali Ait-BachirBastien Dine, Rachid Mokhtari, Yagmur Gizem Cinar
April 16, 2018
IEEE INFOCOM 2018The 8th International Workshop on Big Data in Cloud
Performance (DCPerf’18)
Table of contents
1. Introduction
2. Metrics prediction
3. Evaluation
4. Conclusion
2 / 22
Introduction
System goal: anticipate failures
monitoring data
machine learning system
I Monitoring insights
I Failure prediction
I Infrastructure scaling
I More server uptime
3 / 22
Introduction
System goal: anticipate failures
monitoring data
machine learning
system
I Monitoring insights
I Failure prediction
I Infrastructure scaling
I More server uptime
3 / 22
Introduction
System goal: anticipate failures
monitoring data
machine learning system
I Monitoring insights
I Failure prediction
I Infrastructure scaling
I More server uptime
3 / 22
Introduction
System goal: anticipate failures
monitoring data
machine learning system
I Monitoring insights
I Failure prediction
I Infrastructure scaling
I More server uptime
3 / 22
Introduction
System goal: anticipate failures
monitoring data
machine learning system
I Monitoring insights
I Failure prediction
I Infrastructure scaling
I More server uptime
3 / 22
Introduction
Desired properties
I Scalable infrastructure: at least up to a few servers (around150 CPU cores)
I End-to-end fault tolerance: metrics can never be lost
I Performances: “fast” to compute metrics predictions (lowlatency)
4 / 22
Introduction
Desired properties
I Scalable infrastructure: at least up to a few servers (around150 CPU cores)
I End-to-end fault tolerance: metrics can never be lost
I Performances: “fast” to compute metrics predictions (lowlatency)
4 / 22
Introduction
Desired properties
I Scalable infrastructure: at least up to a few servers (around150 CPU cores)
I End-to-end fault tolerance: metrics can never be lost
I Performances: “fast” to compute metrics predictions (lowlatency)
4 / 22
Introduction
Metrics
I Monitoring metric: observation point on a server in adatacenter
I System metrics: CPU load, memory, open sockets
I Service metrics: database status, web server response time
I Reported by agents, processed, and stored
I Computed as time-series
I Associated to thresholds: warning and critical
5 / 22
Introduction
Metrics
I Monitoring metric: observation point on a server in adatacenter
I System metrics: CPU load, memory, open sockets
I Service metrics: database status, web server response time
I Reported by agents, processed, and stored
I Computed as time-series
I Associated to thresholds: warning and critical
5 / 22
Introduction
Metrics
I Monitoring metric: observation point on a server in adatacenter
I System metrics: CPU load, memory, open sockets
I Service metrics: database status, web server response time
I Reported by agents, processed, and stored
I Computed as time-series
I Associated to thresholds: warning and critical
5 / 22
Introduction
Metrics
I Monitoring metric: observation point on a server in adatacenter
I System metrics: CPU load, memory, open sockets
I Service metrics: database status, web server response time
I Reported by agents, processed, and stored
I Computed as time-series
I Associated to thresholds: warning and critical
5 / 22
Introduction
Metrics
I Monitoring metric: observation point on a server in adatacenter
I System metrics: CPU load, memory, open sockets
I Service metrics: database status, web server response time
I Reported by agents, processed, and stored
I Computed as time-series
I Associated to thresholds: warning and critical
5 / 22
Introduction
Metrics
I Monitoring metric: observation point on a server in adatacenter
I System metrics: CPU load, memory, open sockets
I Service metrics: database status, web server response time
I Reported by agents, processed, and stored
I Computed as time-series
I Associated to thresholds: warning and critical
5 / 22
Metrics prediction
Metrics behaviour: 6 scenarios
Value
Critical zone
Warning zone
Quick rise
Slow riseTransient rise
Perplexity pointSlow rise
Quick rise
Time
6 / 22
Metrics prediction
Linear regression
time
value
I Ability to identify localtrends (few hours)
I Fast to compute
I Good candidate to avoidfalse positives (peaks)
I Library: MLlib (part ofApache Spark)
7 / 22
Metrics prediction
Linear regression
time
value
I Ability to identify localtrends (few hours)
I Fast to compute
I Good candidate to avoidfalse positives (peaks)
I Library: MLlib (part ofApache Spark)
7 / 22
Metrics prediction
Linear regression
time
value
I Ability to identify localtrends (few hours)
I Fast to compute
I Good candidate to avoidfalse positives (peaks)
I Library: MLlib (part ofApache Spark)
7 / 22
Metrics prediction
Linear regression
time
value
I Ability to identify localtrends (few hours)
I Fast to compute
I Good candidate to avoidfalse positives (peaks)
I Library: MLlib (part ofApache Spark)
7 / 22
Metrics prediction
Linear regression
time
value
I Ability to identify localtrends (few hours)
I Fast to compute
I Good candidate to avoidfalse positives (peaks)
I Library: MLlib (part ofApache Spark)
7 / 22
Metrics prediction
Linear regression
time
value
I Ability to identify localtrends (few hours)
I Fast to compute
I Good candidate to avoidfalse positives (peaks)
I Library: MLlib (part ofApache Spark)
7 / 22
Metrics prediction
System architecture
xx
x
Monitoring agents
Monitoringbroker
Cassandradatabase
Spark +MLlib
Alertmanager
GUI ...
8 / 22
Metrics prediction
System architecture
xx
x
Monitoring agents
Monitoringbroker
Cassandradatabase
Spark +MLlib
Alertmanager
GUI ...
8 / 22
Metrics prediction
System architecture
xx
x
Monitoring agents
Monitoringbroker
Cassandradatabase
Spark +MLlib
Alertmanager
GUI ...
8 / 22
Metrics prediction
System architecture
xx
x
Monitoring agents
Monitoringbroker
Cassandradatabase
Spark +MLlib
Alertmanager
GUI ...
8 / 22
Metrics prediction
System architecture
xx
x
Monitoring agents
Monitoringbroker
Cassandradatabase
Spark +MLlib
Alertmanager
GUI ...
8 / 22
Metrics prediction
Metric blacklisting
I Some metrics are too volatile and hard to predict
I To avoid false positives/negatives, and save resources, theyare blacklisted
I Root Mean Square Error evaluated weekly
I Metrics (temporarily) blacklisted if their RMSE > threshold
I 58.5% of the metrics have a low RMSE → good predictions
9 / 22
Metrics prediction
Metric blacklisting
I Some metrics are too volatile and hard to predict
I To avoid false positives/negatives, and save resources, theyare blacklisted
I Root Mean Square Error evaluated weekly
I Metrics (temporarily) blacklisted if their RMSE > threshold
I 58.5% of the metrics have a low RMSE → good predictions
9 / 22
Metrics prediction
Metric blacklisting
I Some metrics are too volatile and hard to predict
I To avoid false positives/negatives, and save resources, theyare blacklisted
I Root Mean Square Error evaluated weekly
I Metrics (temporarily) blacklisted if their RMSE > threshold
I 58.5% of the metrics have a low RMSE → good predictions
9 / 22
Metrics prediction
Metric blacklisting
I Some metrics are too volatile and hard to predict
I To avoid false positives/negatives, and save resources, theyare blacklisted
I Root Mean Square Error evaluated weekly
I Metrics (temporarily) blacklisted if their RMSE > threshold
I 58.5% of the metrics have a low RMSE → good predictions
9 / 22
Metrics prediction
Metric blacklisting
I Some metrics are too volatile and hard to predict
I To avoid false positives/negatives, and save resources, theyare blacklisted
I Root Mean Square Error evaluated weekly
I Metrics (temporarily) blacklisted if their RMSE > threshold
I 58.5% of the metrics have a low RMSE → good predictions
9 / 22
Metrics prediction
Example
0 2 4 6 8 10
2.3
2.4
2.5
2.6
2.7
time (hours)
metricvalue
past
Figure: swap memory
10 / 22
Metrics prediction
Example
0 2 4 6 8 10
2.3
2.4
2.5
2.6
2.7
time (hours)
metricvalue
past predicted
prediction is computed
Figure: swap memory
10 / 22
Metrics prediction
Example
0 2 4 6 8 10
2.3
2.4
2.5
2.6
2.7
time (hours)
metricvalue
past predicted future
prediction is computed
Figure: swap memory
10 / 22
Metrics prediction
Example
0 2 4 6 8 102
2.1
2.2
2.3
2.4
time (hours)
metricvalue
past
Figure: physical memory
11 / 22
Metrics prediction
Example
0 2 4 6 8 102
2.1
2.2
2.3
2.4
time (hours)
metricvalue
past predicted
Figure: physical memory
11 / 22
Metrics prediction
Example
0 2 4 6 8 102
2.1
2.2
2.3
2.4
time (hours)
metricvalue
past predicted future
Figure: physical memory
11 / 22
Metrics prediction
Example
0 5 10 15 20688
690
692
694
696Warning
time (hours)
metricvalue
past
Figure: disk partition
12 / 22
Metrics prediction
Example
0 5 10 15 20688
690
692
694
696Warning
time (hours)
metricvalue
past predicted
raise alert: diskfull in 10 hours
Figure: disk partition
12 / 22
Metrics prediction
Example
0 5 10 15 20688
690
692
694
696Warning
time (hours)
metricvalue
past predicted future
raise alert: diskfull in 10 hours
Figure: disk partition
12 / 22
Evaluation
Setup
I Hardware: 4 servers (16–28 cores, 128–256 GB RAM)
I Dataset: Replay on production data recorded at Coservit
I 424 206 metrics, 1.5 billion data points monitored on 25 070servers
13 / 22
Evaluation
CPU load and memory consumption
0 200 400 600 8000%20%40%60%80%
100%120%
time (seconds)
CPU memory
(a) master
0 200 400 600 8000%20%40%60%80%100%120%
time (seconds)
CPU memory
(b) slave-1
Figure: Running on 4 machines and 100 cores for 15 minutes.
14 / 22
Evaluation
Time repartition
load createdataframe
train predictwtfsavewtfpublish0
100
200
300
400
500time(m
s)
Figure: Time repartition for predicting a metric.
15 / 22
Evaluation
Load handling
I End-to-end process for the prediction of 1 metric: 1 second.
I One monitoring server (with 24 cores) can handle the load of1440 metrics (at worst), which is 85 servers on average.
16 / 22
Evaluation
Load handling
I End-to-end process for the prediction of 1 metric: 1 second.
I One monitoring server (with 24 cores) can handle the load of1440 metrics (at worst), which is 85 servers on average.
16 / 22
Evaluation
Load handling: linear scaling
0 20 40 60 80 100 120 1400
20
40
60
80
100
120
CPU cores
processed
metrics
(x1000) 1 slave
Figure: Amount of metrics handled in 15 minutes.
17 / 22
Evaluation
Load handling: linear scaling
0 20 40 60 80 100 120 1400
20
40
60
80
100
120
CPU cores
processed
metrics
(x1000) 1 slave
2 slaves
Figure: Amount of metrics handled in 15 minutes.
17 / 22
Evaluation
Load handling: linear scaling
0 20 40 60 80 100 120 1400
20
40
60
80
100
120
CPU cores
processed
metrics
(x1000) 1 slave
2 slaves3 slaves
Figure: Amount of metrics handled in 15 minutes.
17 / 22
Conclusion
Related work
Positioning
No published work exhibits the same system (end-to-end systemfor monitoring metrics prediction, storage and blacklisting).
Prediction models
I Hardware failures [CAS12]
I Capacity planning (e.g. Microsoft Azure [mic])
I Datacenter temperature (e.g. Thermocast [LLL+11])
I Monitoring metrics (e.g. Zabbix [zab] with manual tuning)
18 / 22
Conclusion
Related work
Positioning
No published work exhibits the same system (end-to-end systemfor monitoring metrics prediction, storage and blacklisting).
Prediction models
I Hardware failures [CAS12]
I Capacity planning (e.g. Microsoft Azure [mic])
I Datacenter temperature (e.g. Thermocast [LLL+11])
I Monitoring metrics (e.g. Zabbix [zab] with manual tuning)
18 / 22
Conclusion
Related work
Positioning
No published work exhibits the same system (end-to-end systemfor monitoring metrics prediction, storage and blacklisting).
Prediction models
I Hardware failures [CAS12]
I Capacity planning (e.g. Microsoft Azure [mic])
I Datacenter temperature (e.g. Thermocast [LLL+11])
I Monitoring metrics (e.g. Zabbix [zab] with manual tuning)
18 / 22
Conclusion
Related work
Positioning
No published work exhibits the same system (end-to-end systemfor monitoring metrics prediction, storage and blacklisting).
Prediction models
I Hardware failures [CAS12]
I Capacity planning (e.g. Microsoft Azure [mic])
I Datacenter temperature (e.g. Thermocast [LLL+11])
I Monitoring metrics (e.g. Zabbix [zab] with manual tuning)
18 / 22
Conclusion
Future work
I Experiment with more complex ML algorithms
I Predictions on long-term global trends
I Link with ticketing mechanism
19 / 22
Thanks! Questions?
Bibliography I
T. Chalermarrewong, T. Achalakul, and S. C. W. See.Failure prediction of data centers using time series and faulttree analysis.In 2012 IEEE 18th International Conference on Parallel andDistributed Systems, pages 794–799, Dec 2012.
Lei Li, Chieh-Jan Mike Liang, Jie Liu, Suman Nath, AndreasTerzis, and Christos Faloutsos.Thermocast: A cyber-physical forecasting model fordatacenters.In Proceedings of the 17th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining, KDD’11, pages 1370–1378, New York, NY, USA, 2011. ACM.
21 / 22
Bibliography II
Microsoft cloud azure.https://docs.microsoft.com/en-us/azure/
machine-learning/
machine-learning-algorithm-choice.
Zabbix prediction triggers.https://www.zabbix.com/documentation/3.0/manual/
config/triggers/prediction.
22 / 22