Online Metrics Prediction in Monitoring Systems · Online Metrics Prediction in Monitoring Systems...

transcript

Online Metrics Prediction in Monitoring Systems

Matthieu Caneill, Noel De Palma, Ali Ait-BachirBastien Dine, Rachid Mokhtari, Yagmur Gizem Cinar

April 16, 2018

IEEE INFOCOM 2018The 8th International Workshop on Big Data in Cloud

Performance (DCPerf’18)

Table of contents

1. Introduction

2. Metrics prediction

3. Evaluation

4. Conclusion

2 / 22

Introduction

System goal: anticipate failures

monitoring data

machine learning system

I Monitoring insights

I Failure prediction

I Infrastructure scaling

I More server uptime

3 / 22

Introduction

monitoring data

machine learning

system

3 / 22

Introduction

monitoring data

3 / 22

Introduction

monitoring data

3 / 22

Introduction

monitoring data

3 / 22

Introduction

Desired properties

I Scalable infrastructure: at least up to a few servers (around150 CPU cores)

I End-to-end fault tolerance: metrics can never be lost

I Performances: “fast” to compute metrics predictions (lowlatency)

4 / 22

Introduction

Desired properties

4 / 22

Introduction

Desired properties

4 / 22

Introduction

Metrics

I Monitoring metric: observation point on a server in adatacenter

I System metrics: CPU load, memory, open sockets

I Service metrics: database status, web server response time

I Reported by agents, processed, and stored

I Computed as time-series

I Associated to thresholds: warning and critical

5 / 22

Introduction

Metrics

5 / 22

Introduction

Metrics

5 / 22

Introduction

Metrics

5 / 22

Introduction

Metrics

5 / 22

Introduction

Metrics

5 / 22

Metrics prediction

Metrics behaviour: 6 scenarios

Critical zone

Warning zone

Quick rise

Slow riseTransient rise

Perplexity pointSlow rise

Quick rise

6 / 22

Metrics prediction

Linear regression

I Ability to identify localtrends (few hours)

I Fast to compute

I Good candidate to avoidfalse positives (peaks)

I Library: MLlib (part ofApache Spark)

7 / 22

Metrics prediction

Linear regression

I Fast to compute

7 / 22

Metrics prediction

Linear regression

I Fast to compute

7 / 22

Metrics prediction

Linear regression

I Fast to compute

7 / 22

Metrics prediction

Linear regression

I Fast to compute

7 / 22

Metrics prediction

Linear regression

I Fast to compute

7 / 22

Metrics prediction

System architecture

Monitoring agents

Monitoringbroker

Cassandradatabase

Spark +MLlib

Alertmanager

GUI ...

8 / 22

Metrics prediction

System architecture

Monitoring agents

Monitoringbroker

Cassandradatabase

Spark +MLlib

Alertmanager

GUI ...

8 / 22

Metrics prediction

System architecture

Monitoring agents

Monitoringbroker

Cassandradatabase

Spark +MLlib

Alertmanager

GUI ...

8 / 22

Metrics prediction

System architecture

Monitoring agents

Monitoringbroker

Cassandradatabase

Spark +MLlib

Alertmanager

GUI ...

8 / 22

Metrics prediction

System architecture

Monitoring agents

Monitoringbroker

Cassandradatabase

Spark +MLlib

Alertmanager

GUI ...

8 / 22

Metrics prediction

Metric blacklisting

I Some metrics are too volatile and hard to predict

I To avoid false positives/negatives, and save resources, theyare blacklisted

I Root Mean Square Error evaluated weekly

I Metrics (temporarily) blacklisted if their RMSE > threshold

I 58.5% of the metrics have a low RMSE → good predictions

9 / 22

Metrics prediction

Metric blacklisting

9 / 22

Metrics prediction

Metric blacklisting

9 / 22

Metrics prediction

Metric blacklisting

9 / 22

Metrics prediction

Metric blacklisting

9 / 22

Metrics prediction

Example

0 2 4 6 8 10

time (hours)

metricvalue

Figure: swap memory

10 / 22

Metrics prediction

Example

0 2 4 6 8 10

time (hours)

metricvalue

past predicted

prediction is computed

Figure: swap memory

10 / 22

Metrics prediction

Example

0 2 4 6 8 10

time (hours)

metricvalue

past predicted future

prediction is computed

Figure: swap memory

10 / 22

Metrics prediction

Example

0 2 4 6 8 102

time (hours)

metricvalue

Figure: physical memory

11 / 22

Metrics prediction

Example

0 2 4 6 8 102

time (hours)

metricvalue

past predicted

11 / 22

Metrics prediction

Example

0 2 4 6 8 102

time (hours)

metricvalue

11 / 22

Metrics prediction

Example

0 5 10 15 20688

696Warning

time (hours)

metricvalue

Figure: disk partition

12 / 22

Metrics prediction

Example

0 5 10 15 20688

696Warning

time (hours)

metricvalue

past predicted

raise alert: diskfull in 10 hours

12 / 22

Metrics prediction

Example

0 5 10 15 20688

696Warning

time (hours)

metricvalue

raise alert: diskfull in 10 hours

12 / 22

Evaluation

I Hardware: 4 servers (16–28 cores, 128–256 GB RAM)

I Dataset: Replay on production data recorded at Coservit

I 424 206 metrics, 1.5 billion data points monitored on 25 070servers

13 / 22

Evaluation

CPU load and memory consumption

0 200 400 600 8000%20%40%60%80%

100%120%

time (seconds)

CPU memory

(a) master

0 200 400 600 8000%20%40%60%80%100%120%

time (seconds)

CPU memory

(b) slave-1

Figure: Running on 4 machines and 100 cores for 15 minutes.

14 / 22

Evaluation

Time repartition

load createdataframe

train predictwtfsavewtfpublish0

500time(m

Figure: Time repartition for predicting a metric.

15 / 22

Evaluation

Load handling

I End-to-end process for the prediction of 1 metric: 1 second.

I One monitoring server (with 24 cores) can handle the load of1440 metrics (at worst), which is 85 servers on average.

16 / 22

Evaluation

Load handling

I End-to-end process for the prediction of 1 metric: 1 second.

I One monitoring server (with 24 cores) can handle the load of1440 metrics (at worst), which is 85 servers on average.

16 / 22

Evaluation

Load handling: linear scaling

0 20 40 60 80 100 120 1400

CPU cores

processed

metrics

(x1000) 1 slave

Figure: Amount of metrics handled in 15 minutes.

17 / 22

Evaluation

0 20 40 60 80 100 120 1400

CPU cores

processed

metrics

(x1000) 1 slave

2 slaves

17 / 22

Evaluation

0 20 40 60 80 100 120 1400

CPU cores

processed

metrics

(x1000) 1 slave

2 slaves3 slaves

17 / 22

Conclusion

Related work

Positioning

No published work exhibits the same system (end-to-end systemfor monitoring metrics prediction, storage and blacklisting).

Prediction models

I Hardware failures [CAS12]

I Capacity planning (e.g. Microsoft Azure [mic])

I Datacenter temperature (e.g. Thermocast [LLL+11])

I Monitoring metrics (e.g. Zabbix [zab] with manual tuning)

18 / 22

Conclusion

Related work

Positioning

Prediction models

18 / 22

Conclusion

Related work

Positioning

Prediction models

18 / 22

Conclusion

Related work

Positioning

Prediction models

18 / 22

Conclusion

Future work

I Experiment with more complex ML algorithms

I Predictions on long-term global trends

I Link with ticketing mechanism

19 / 22

Thanks! Questions?

Bibliography I

T. Chalermarrewong, T. Achalakul, and S. C. W. See.Failure prediction of data centers using time series and faulttree analysis.In 2012 IEEE 18th International Conference on Parallel andDistributed Systems, pages 794–799, Dec 2012.

Lei Li, Chieh-Jan Mike Liang, Jie Liu, Suman Nath, AndreasTerzis, and Christos Faloutsos.Thermocast: A cyber-physical forecasting model fordatacenters.In Proceedings of the 17th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining, KDD’11, pages 1370–1378, New York, NY, USA, 2011. ACM.

21 / 22

Bibliography II

Microsoft cloud azure.https://docs.microsoft.com/en-us/azure/

machine-learning/

machine-learning-algorithm-choice.

Zabbix prediction triggers.https://www.zabbix.com/documentation/3.0/manual/

config/triggers/prediction.

22 / 22

Online Metrics Prediction in Monitoring Systems · Online Metrics Prediction in Monitoring Systems...

Documents