Online Metrics Prediction in Monitoring Systems · Online Metrics Prediction in Monitoring Systems...

Post on 12-May-2020

4 views 0 download

transcript

Online Metrics Prediction in Monitoring Systems

Matthieu Caneill, Noel De Palma, Ali Ait-BachirBastien Dine, Rachid Mokhtari, Yagmur Gizem Cinar

April 16, 2018

IEEE INFOCOM 2018The 8th International Workshop on Big Data in Cloud

Performance (DCPerf’18)

Table of contents

1. Introduction

2. Metrics prediction

3. Evaluation

4. Conclusion

2 / 22

Introduction

System goal: anticipate failures

monitoring data

machine learning system

I Monitoring insights

I Failure prediction

I Infrastructure scaling

I More server uptime

3 / 22

Introduction

System goal: anticipate failures

monitoring data

machine learning

system

I Monitoring insights

I Failure prediction

I Infrastructure scaling

I More server uptime

3 / 22

Introduction

System goal: anticipate failures

monitoring data

machine learning system

I Monitoring insights

I Failure prediction

I Infrastructure scaling

I More server uptime

3 / 22

Introduction

System goal: anticipate failures

monitoring data

machine learning system

I Monitoring insights

I Failure prediction

I Infrastructure scaling

I More server uptime

3 / 22

Introduction

System goal: anticipate failures

monitoring data

machine learning system

I Monitoring insights

I Failure prediction

I Infrastructure scaling

I More server uptime

3 / 22

Introduction

Desired properties

I Scalable infrastructure: at least up to a few servers (around150 CPU cores)

I End-to-end fault tolerance: metrics can never be lost

I Performances: “fast” to compute metrics predictions (lowlatency)

4 / 22

Introduction

Desired properties

I Scalable infrastructure: at least up to a few servers (around150 CPU cores)

I End-to-end fault tolerance: metrics can never be lost

I Performances: “fast” to compute metrics predictions (lowlatency)

4 / 22

Introduction

Desired properties

I Scalable infrastructure: at least up to a few servers (around150 CPU cores)

I End-to-end fault tolerance: metrics can never be lost

I Performances: “fast” to compute metrics predictions (lowlatency)

4 / 22

Introduction

Metrics

I Monitoring metric: observation point on a server in adatacenter

I System metrics: CPU load, memory, open sockets

I Service metrics: database status, web server response time

I Reported by agents, processed, and stored

I Computed as time-series

I Associated to thresholds: warning and critical

5 / 22

Introduction

Metrics

I Monitoring metric: observation point on a server in adatacenter

I System metrics: CPU load, memory, open sockets

I Service metrics: database status, web server response time

I Reported by agents, processed, and stored

I Computed as time-series

I Associated to thresholds: warning and critical

5 / 22

Introduction

Metrics

I Monitoring metric: observation point on a server in adatacenter

I System metrics: CPU load, memory, open sockets

I Service metrics: database status, web server response time

I Reported by agents, processed, and stored

I Computed as time-series

I Associated to thresholds: warning and critical

5 / 22

Introduction

Metrics

I Monitoring metric: observation point on a server in adatacenter

I System metrics: CPU load, memory, open sockets

I Service metrics: database status, web server response time

I Reported by agents, processed, and stored

I Computed as time-series

I Associated to thresholds: warning and critical

5 / 22

Introduction

Metrics

I Monitoring metric: observation point on a server in adatacenter

I System metrics: CPU load, memory, open sockets

I Service metrics: database status, web server response time

I Reported by agents, processed, and stored

I Computed as time-series

I Associated to thresholds: warning and critical

5 / 22

Introduction

Metrics

I Monitoring metric: observation point on a server in adatacenter

I System metrics: CPU load, memory, open sockets

I Service metrics: database status, web server response time

I Reported by agents, processed, and stored

I Computed as time-series

I Associated to thresholds: warning and critical

5 / 22

Metrics prediction

Metrics behaviour: 6 scenarios

Value

Critical zone

Warning zone

Quick rise

Slow riseTransient rise

Perplexity pointSlow rise

Quick rise

Time

6 / 22

Metrics prediction

Linear regression

time

value

I Ability to identify localtrends (few hours)

I Fast to compute

I Good candidate to avoidfalse positives (peaks)

I Library: MLlib (part ofApache Spark)

7 / 22

Metrics prediction

Linear regression

time

value

I Ability to identify localtrends (few hours)

I Fast to compute

I Good candidate to avoidfalse positives (peaks)

I Library: MLlib (part ofApache Spark)

7 / 22

Metrics prediction

Linear regression

time

value

I Ability to identify localtrends (few hours)

I Fast to compute

I Good candidate to avoidfalse positives (peaks)

I Library: MLlib (part ofApache Spark)

7 / 22

Metrics prediction

Linear regression

time

value

I Ability to identify localtrends (few hours)

I Fast to compute

I Good candidate to avoidfalse positives (peaks)

I Library: MLlib (part ofApache Spark)

7 / 22

Metrics prediction

Linear regression

time

value

I Ability to identify localtrends (few hours)

I Fast to compute

I Good candidate to avoidfalse positives (peaks)

I Library: MLlib (part ofApache Spark)

7 / 22

Metrics prediction

Linear regression

time

value

I Ability to identify localtrends (few hours)

I Fast to compute

I Good candidate to avoidfalse positives (peaks)

I Library: MLlib (part ofApache Spark)

7 / 22

Metrics prediction

System architecture

xx

x

Monitoring agents

Monitoringbroker

Cassandradatabase

Spark +MLlib

Alertmanager

GUI ...

8 / 22

Metrics prediction

System architecture

xx

x

Monitoring agents

Monitoringbroker

Cassandradatabase

Spark +MLlib

Alertmanager

GUI ...

8 / 22

Metrics prediction

System architecture

xx

x

Monitoring agents

Monitoringbroker

Cassandradatabase

Spark +MLlib

Alertmanager

GUI ...

8 / 22

Metrics prediction

System architecture

xx

x

Monitoring agents

Monitoringbroker

Cassandradatabase

Spark +MLlib

Alertmanager

GUI ...

8 / 22

Metrics prediction

System architecture

xx

x

Monitoring agents

Monitoringbroker

Cassandradatabase

Spark +MLlib

Alertmanager

GUI ...

8 / 22

Metrics prediction

Metric blacklisting

I Some metrics are too volatile and hard to predict

I To avoid false positives/negatives, and save resources, theyare blacklisted

I Root Mean Square Error evaluated weekly

I Metrics (temporarily) blacklisted if their RMSE > threshold

I 58.5% of the metrics have a low RMSE → good predictions

9 / 22

Metrics prediction

Metric blacklisting

I Some metrics are too volatile and hard to predict

I To avoid false positives/negatives, and save resources, theyare blacklisted

I Root Mean Square Error evaluated weekly

I Metrics (temporarily) blacklisted if their RMSE > threshold

I 58.5% of the metrics have a low RMSE → good predictions

9 / 22

Metrics prediction

Metric blacklisting

I Some metrics are too volatile and hard to predict

I To avoid false positives/negatives, and save resources, theyare blacklisted

I Root Mean Square Error evaluated weekly

I Metrics (temporarily) blacklisted if their RMSE > threshold

I 58.5% of the metrics have a low RMSE → good predictions

9 / 22

Metrics prediction

Metric blacklisting

I Some metrics are too volatile and hard to predict

I To avoid false positives/negatives, and save resources, theyare blacklisted

I Root Mean Square Error evaluated weekly

I Metrics (temporarily) blacklisted if their RMSE > threshold

I 58.5% of the metrics have a low RMSE → good predictions

9 / 22

Metrics prediction

Metric blacklisting

I Some metrics are too volatile and hard to predict

I To avoid false positives/negatives, and save resources, theyare blacklisted

I Root Mean Square Error evaluated weekly

I Metrics (temporarily) blacklisted if their RMSE > threshold

I 58.5% of the metrics have a low RMSE → good predictions

9 / 22

Metrics prediction

Example

0 2 4 6 8 10

2.3

2.4

2.5

2.6

2.7

time (hours)

metricvalue

past

Figure: swap memory

10 / 22

Metrics prediction

Example

0 2 4 6 8 10

2.3

2.4

2.5

2.6

2.7

time (hours)

metricvalue

past predicted

prediction is computed

Figure: swap memory

10 / 22

Metrics prediction

Example

0 2 4 6 8 10

2.3

2.4

2.5

2.6

2.7

time (hours)

metricvalue

past predicted future

prediction is computed

Figure: swap memory

10 / 22

Metrics prediction

Example

0 2 4 6 8 102

2.1

2.2

2.3

2.4

time (hours)

metricvalue

past

Figure: physical memory

11 / 22

Metrics prediction

Example

0 2 4 6 8 102

2.1

2.2

2.3

2.4

time (hours)

metricvalue

past predicted

Figure: physical memory

11 / 22

Metrics prediction

Example

0 2 4 6 8 102

2.1

2.2

2.3

2.4

time (hours)

metricvalue

past predicted future

Figure: physical memory

11 / 22

Metrics prediction

Example

0 5 10 15 20688

690

692

694

696Warning

time (hours)

metricvalue

past

Figure: disk partition

12 / 22

Metrics prediction

Example

0 5 10 15 20688

690

692

694

696Warning

time (hours)

metricvalue

past predicted

raise alert: diskfull in 10 hours

Figure: disk partition

12 / 22

Metrics prediction

Example

0 5 10 15 20688

690

692

694

696Warning

time (hours)

metricvalue

past predicted future

raise alert: diskfull in 10 hours

Figure: disk partition

12 / 22

Evaluation

Setup

I Hardware: 4 servers (16–28 cores, 128–256 GB RAM)

I Dataset: Replay on production data recorded at Coservit

I 424 206 metrics, 1.5 billion data points monitored on 25 070servers

13 / 22

Evaluation

CPU load and memory consumption

0 200 400 600 8000%20%40%60%80%

100%120%

time (seconds)

CPU memory

(a) master

0 200 400 600 8000%20%40%60%80%100%120%

time (seconds)

CPU memory

(b) slave-1

Figure: Running on 4 machines and 100 cores for 15 minutes.

14 / 22

Evaluation

Time repartition

load createdataframe

train predictwtfsavewtfpublish0

100

200

300

400

500time(m

s)

Figure: Time repartition for predicting a metric.

15 / 22

Evaluation

Load handling

I End-to-end process for the prediction of 1 metric: 1 second.

I One monitoring server (with 24 cores) can handle the load of1440 metrics (at worst), which is 85 servers on average.

16 / 22

Evaluation

Load handling

I End-to-end process for the prediction of 1 metric: 1 second.

I One monitoring server (with 24 cores) can handle the load of1440 metrics (at worst), which is 85 servers on average.

16 / 22

Evaluation

Load handling: linear scaling

0 20 40 60 80 100 120 1400

20

40

60

80

100

120

CPU cores

processed

metrics

(x1000) 1 slave

Figure: Amount of metrics handled in 15 minutes.

17 / 22

Evaluation

Load handling: linear scaling

0 20 40 60 80 100 120 1400

20

40

60

80

100

120

CPU cores

processed

metrics

(x1000) 1 slave

2 slaves

Figure: Amount of metrics handled in 15 minutes.

17 / 22

Evaluation

Load handling: linear scaling

0 20 40 60 80 100 120 1400

20

40

60

80

100

120

CPU cores

processed

metrics

(x1000) 1 slave

2 slaves3 slaves

Figure: Amount of metrics handled in 15 minutes.

17 / 22

Conclusion

Related work

Positioning

No published work exhibits the same system (end-to-end systemfor monitoring metrics prediction, storage and blacklisting).

Prediction models

I Hardware failures [CAS12]

I Capacity planning (e.g. Microsoft Azure [mic])

I Datacenter temperature (e.g. Thermocast [LLL+11])

I Monitoring metrics (e.g. Zabbix [zab] with manual tuning)

18 / 22

Conclusion

Related work

Positioning

No published work exhibits the same system (end-to-end systemfor monitoring metrics prediction, storage and blacklisting).

Prediction models

I Hardware failures [CAS12]

I Capacity planning (e.g. Microsoft Azure [mic])

I Datacenter temperature (e.g. Thermocast [LLL+11])

I Monitoring metrics (e.g. Zabbix [zab] with manual tuning)

18 / 22

Conclusion

Related work

Positioning

No published work exhibits the same system (end-to-end systemfor monitoring metrics prediction, storage and blacklisting).

Prediction models

I Hardware failures [CAS12]

I Capacity planning (e.g. Microsoft Azure [mic])

I Datacenter temperature (e.g. Thermocast [LLL+11])

I Monitoring metrics (e.g. Zabbix [zab] with manual tuning)

18 / 22

Conclusion

Related work

Positioning

No published work exhibits the same system (end-to-end systemfor monitoring metrics prediction, storage and blacklisting).

Prediction models

I Hardware failures [CAS12]

I Capacity planning (e.g. Microsoft Azure [mic])

I Datacenter temperature (e.g. Thermocast [LLL+11])

I Monitoring metrics (e.g. Zabbix [zab] with manual tuning)

18 / 22

Conclusion

Future work

I Experiment with more complex ML algorithms

I Predictions on long-term global trends

I Link with ticketing mechanism

19 / 22

Thanks! Questions?

Bibliography I

T. Chalermarrewong, T. Achalakul, and S. C. W. See.Failure prediction of data centers using time series and faulttree analysis.In 2012 IEEE 18th International Conference on Parallel andDistributed Systems, pages 794–799, Dec 2012.

Lei Li, Chieh-Jan Mike Liang, Jie Liu, Suman Nath, AndreasTerzis, and Christos Faloutsos.Thermocast: A cyber-physical forecasting model fordatacenters.In Proceedings of the 17th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining, KDD’11, pages 1370–1378, New York, NY, USA, 2011. ACM.

21 / 22