+ All Categories
Home > Documents > Online Metrics Prediction in Monitoring Systems · Online Metrics Prediction in Monitoring Systems...

Online Metrics Prediction in Monitoring Systems · Online Metrics Prediction in Monitoring Systems...

Date post: 12-May-2020
Category:
Upload: others
View: 4 times
Download: 0 times
Share this document with a friend
58
Online Metrics Prediction in Monitoring Systems Matthieu Caneill, No¨ el De Palma, Ali Ait-Bachir Bastien Dine, Rachid Mokhtari, Yagmur Gizem Cinar April 16, 2018 IEEE INFOCOM 2018 The 8th International Workshop on Big Data in Cloud Performance (DCPerf’18)
Transcript
Page 1: Online Metrics Prediction in Monitoring Systems · Online Metrics Prediction in Monitoring Systems Matthieu Caneill, No el De Palma, Ali Ait-Bachir Bastien Dine, Rachid Mokhtari ...

Online Metrics Prediction in Monitoring Systems

Matthieu Caneill, Noel De Palma, Ali Ait-BachirBastien Dine, Rachid Mokhtari, Yagmur Gizem Cinar

April 16, 2018

IEEE INFOCOM 2018The 8th International Workshop on Big Data in Cloud

Performance (DCPerf’18)

Page 2: Online Metrics Prediction in Monitoring Systems · Online Metrics Prediction in Monitoring Systems Matthieu Caneill, No el De Palma, Ali Ait-Bachir Bastien Dine, Rachid Mokhtari ...

Table of contents

1. Introduction

2. Metrics prediction

3. Evaluation

4. Conclusion

2 / 22

Page 3: Online Metrics Prediction in Monitoring Systems · Online Metrics Prediction in Monitoring Systems Matthieu Caneill, No el De Palma, Ali Ait-Bachir Bastien Dine, Rachid Mokhtari ...

Introduction

System goal: anticipate failures

monitoring data

machine learning system

I Monitoring insights

I Failure prediction

I Infrastructure scaling

I More server uptime

3 / 22

Page 4: Online Metrics Prediction in Monitoring Systems · Online Metrics Prediction in Monitoring Systems Matthieu Caneill, No el De Palma, Ali Ait-Bachir Bastien Dine, Rachid Mokhtari ...

Introduction

System goal: anticipate failures

monitoring data

machine learning

system

I Monitoring insights

I Failure prediction

I Infrastructure scaling

I More server uptime

3 / 22

Page 5: Online Metrics Prediction in Monitoring Systems · Online Metrics Prediction in Monitoring Systems Matthieu Caneill, No el De Palma, Ali Ait-Bachir Bastien Dine, Rachid Mokhtari ...

Introduction

System goal: anticipate failures

monitoring data

machine learning system

I Monitoring insights

I Failure prediction

I Infrastructure scaling

I More server uptime

3 / 22

Page 6: Online Metrics Prediction in Monitoring Systems · Online Metrics Prediction in Monitoring Systems Matthieu Caneill, No el De Palma, Ali Ait-Bachir Bastien Dine, Rachid Mokhtari ...

Introduction

System goal: anticipate failures

monitoring data

machine learning system

I Monitoring insights

I Failure prediction

I Infrastructure scaling

I More server uptime

3 / 22

Page 7: Online Metrics Prediction in Monitoring Systems · Online Metrics Prediction in Monitoring Systems Matthieu Caneill, No el De Palma, Ali Ait-Bachir Bastien Dine, Rachid Mokhtari ...

Introduction

System goal: anticipate failures

monitoring data

machine learning system

I Monitoring insights

I Failure prediction

I Infrastructure scaling

I More server uptime

3 / 22

Page 8: Online Metrics Prediction in Monitoring Systems · Online Metrics Prediction in Monitoring Systems Matthieu Caneill, No el De Palma, Ali Ait-Bachir Bastien Dine, Rachid Mokhtari ...

Introduction

Desired properties

I Scalable infrastructure: at least up to a few servers (around150 CPU cores)

I End-to-end fault tolerance: metrics can never be lost

I Performances: “fast” to compute metrics predictions (lowlatency)

4 / 22

Page 9: Online Metrics Prediction in Monitoring Systems · Online Metrics Prediction in Monitoring Systems Matthieu Caneill, No el De Palma, Ali Ait-Bachir Bastien Dine, Rachid Mokhtari ...

Introduction

Desired properties

I Scalable infrastructure: at least up to a few servers (around150 CPU cores)

I End-to-end fault tolerance: metrics can never be lost

I Performances: “fast” to compute metrics predictions (lowlatency)

4 / 22

Page 10: Online Metrics Prediction in Monitoring Systems · Online Metrics Prediction in Monitoring Systems Matthieu Caneill, No el De Palma, Ali Ait-Bachir Bastien Dine, Rachid Mokhtari ...

Introduction

Desired properties

I Scalable infrastructure: at least up to a few servers (around150 CPU cores)

I End-to-end fault tolerance: metrics can never be lost

I Performances: “fast” to compute metrics predictions (lowlatency)

4 / 22

Page 11: Online Metrics Prediction in Monitoring Systems · Online Metrics Prediction in Monitoring Systems Matthieu Caneill, No el De Palma, Ali Ait-Bachir Bastien Dine, Rachid Mokhtari ...

Introduction

Metrics

I Monitoring metric: observation point on a server in adatacenter

I System metrics: CPU load, memory, open sockets

I Service metrics: database status, web server response time

I Reported by agents, processed, and stored

I Computed as time-series

I Associated to thresholds: warning and critical

5 / 22

Page 12: Online Metrics Prediction in Monitoring Systems · Online Metrics Prediction in Monitoring Systems Matthieu Caneill, No el De Palma, Ali Ait-Bachir Bastien Dine, Rachid Mokhtari ...

Introduction

Metrics

I Monitoring metric: observation point on a server in adatacenter

I System metrics: CPU load, memory, open sockets

I Service metrics: database status, web server response time

I Reported by agents, processed, and stored

I Computed as time-series

I Associated to thresholds: warning and critical

5 / 22

Page 13: Online Metrics Prediction in Monitoring Systems · Online Metrics Prediction in Monitoring Systems Matthieu Caneill, No el De Palma, Ali Ait-Bachir Bastien Dine, Rachid Mokhtari ...

Introduction

Metrics

I Monitoring metric: observation point on a server in adatacenter

I System metrics: CPU load, memory, open sockets

I Service metrics: database status, web server response time

I Reported by agents, processed, and stored

I Computed as time-series

I Associated to thresholds: warning and critical

5 / 22

Page 14: Online Metrics Prediction in Monitoring Systems · Online Metrics Prediction in Monitoring Systems Matthieu Caneill, No el De Palma, Ali Ait-Bachir Bastien Dine, Rachid Mokhtari ...

Introduction

Metrics

I Monitoring metric: observation point on a server in adatacenter

I System metrics: CPU load, memory, open sockets

I Service metrics: database status, web server response time

I Reported by agents, processed, and stored

I Computed as time-series

I Associated to thresholds: warning and critical

5 / 22

Page 15: Online Metrics Prediction in Monitoring Systems · Online Metrics Prediction in Monitoring Systems Matthieu Caneill, No el De Palma, Ali Ait-Bachir Bastien Dine, Rachid Mokhtari ...

Introduction

Metrics

I Monitoring metric: observation point on a server in adatacenter

I System metrics: CPU load, memory, open sockets

I Service metrics: database status, web server response time

I Reported by agents, processed, and stored

I Computed as time-series

I Associated to thresholds: warning and critical

5 / 22

Page 16: Online Metrics Prediction in Monitoring Systems · Online Metrics Prediction in Monitoring Systems Matthieu Caneill, No el De Palma, Ali Ait-Bachir Bastien Dine, Rachid Mokhtari ...

Introduction

Metrics

I Monitoring metric: observation point on a server in adatacenter

I System metrics: CPU load, memory, open sockets

I Service metrics: database status, web server response time

I Reported by agents, processed, and stored

I Computed as time-series

I Associated to thresholds: warning and critical

5 / 22

Page 17: Online Metrics Prediction in Monitoring Systems · Online Metrics Prediction in Monitoring Systems Matthieu Caneill, No el De Palma, Ali Ait-Bachir Bastien Dine, Rachid Mokhtari ...

Metrics prediction

Metrics behaviour: 6 scenarios

Value

Critical zone

Warning zone

Quick rise

Slow riseTransient rise

Perplexity pointSlow rise

Quick rise

Time

6 / 22

Page 18: Online Metrics Prediction in Monitoring Systems · Online Metrics Prediction in Monitoring Systems Matthieu Caneill, No el De Palma, Ali Ait-Bachir Bastien Dine, Rachid Mokhtari ...

Metrics prediction

Linear regression

time

value

I Ability to identify localtrends (few hours)

I Fast to compute

I Good candidate to avoidfalse positives (peaks)

I Library: MLlib (part ofApache Spark)

7 / 22

Page 19: Online Metrics Prediction in Monitoring Systems · Online Metrics Prediction in Monitoring Systems Matthieu Caneill, No el De Palma, Ali Ait-Bachir Bastien Dine, Rachid Mokhtari ...

Metrics prediction

Linear regression

time

value

I Ability to identify localtrends (few hours)

I Fast to compute

I Good candidate to avoidfalse positives (peaks)

I Library: MLlib (part ofApache Spark)

7 / 22

Page 20: Online Metrics Prediction in Monitoring Systems · Online Metrics Prediction in Monitoring Systems Matthieu Caneill, No el De Palma, Ali Ait-Bachir Bastien Dine, Rachid Mokhtari ...

Metrics prediction

Linear regression

time

value

I Ability to identify localtrends (few hours)

I Fast to compute

I Good candidate to avoidfalse positives (peaks)

I Library: MLlib (part ofApache Spark)

7 / 22

Page 21: Online Metrics Prediction in Monitoring Systems · Online Metrics Prediction in Monitoring Systems Matthieu Caneill, No el De Palma, Ali Ait-Bachir Bastien Dine, Rachid Mokhtari ...

Metrics prediction

Linear regression

time

value

I Ability to identify localtrends (few hours)

I Fast to compute

I Good candidate to avoidfalse positives (peaks)

I Library: MLlib (part ofApache Spark)

7 / 22

Page 22: Online Metrics Prediction in Monitoring Systems · Online Metrics Prediction in Monitoring Systems Matthieu Caneill, No el De Palma, Ali Ait-Bachir Bastien Dine, Rachid Mokhtari ...

Metrics prediction

Linear regression

time

value

I Ability to identify localtrends (few hours)

I Fast to compute

I Good candidate to avoidfalse positives (peaks)

I Library: MLlib (part ofApache Spark)

7 / 22

Page 23: Online Metrics Prediction in Monitoring Systems · Online Metrics Prediction in Monitoring Systems Matthieu Caneill, No el De Palma, Ali Ait-Bachir Bastien Dine, Rachid Mokhtari ...

Metrics prediction

Linear regression

time

value

I Ability to identify localtrends (few hours)

I Fast to compute

I Good candidate to avoidfalse positives (peaks)

I Library: MLlib (part ofApache Spark)

7 / 22

Page 24: Online Metrics Prediction in Monitoring Systems · Online Metrics Prediction in Monitoring Systems Matthieu Caneill, No el De Palma, Ali Ait-Bachir Bastien Dine, Rachid Mokhtari ...

Metrics prediction

System architecture

xx

x

Monitoring agents

Monitoringbroker

Cassandradatabase

Spark +MLlib

Alertmanager

GUI ...

8 / 22

Page 25: Online Metrics Prediction in Monitoring Systems · Online Metrics Prediction in Monitoring Systems Matthieu Caneill, No el De Palma, Ali Ait-Bachir Bastien Dine, Rachid Mokhtari ...

Metrics prediction

System architecture

xx

x

Monitoring agents

Monitoringbroker

Cassandradatabase

Spark +MLlib

Alertmanager

GUI ...

8 / 22

Page 26: Online Metrics Prediction in Monitoring Systems · Online Metrics Prediction in Monitoring Systems Matthieu Caneill, No el De Palma, Ali Ait-Bachir Bastien Dine, Rachid Mokhtari ...

Metrics prediction

System architecture

xx

x

Monitoring agents

Monitoringbroker

Cassandradatabase

Spark +MLlib

Alertmanager

GUI ...

8 / 22

Page 27: Online Metrics Prediction in Monitoring Systems · Online Metrics Prediction in Monitoring Systems Matthieu Caneill, No el De Palma, Ali Ait-Bachir Bastien Dine, Rachid Mokhtari ...

Metrics prediction

System architecture

xx

x

Monitoring agents

Monitoringbroker

Cassandradatabase

Spark +MLlib

Alertmanager

GUI ...

8 / 22

Page 28: Online Metrics Prediction in Monitoring Systems · Online Metrics Prediction in Monitoring Systems Matthieu Caneill, No el De Palma, Ali Ait-Bachir Bastien Dine, Rachid Mokhtari ...

Metrics prediction

System architecture

xx

x

Monitoring agents

Monitoringbroker

Cassandradatabase

Spark +MLlib

Alertmanager

GUI ...

8 / 22

Page 29: Online Metrics Prediction in Monitoring Systems · Online Metrics Prediction in Monitoring Systems Matthieu Caneill, No el De Palma, Ali Ait-Bachir Bastien Dine, Rachid Mokhtari ...

Metrics prediction

Metric blacklisting

I Some metrics are too volatile and hard to predict

I To avoid false positives/negatives, and save resources, theyare blacklisted

I Root Mean Square Error evaluated weekly

I Metrics (temporarily) blacklisted if their RMSE > threshold

I 58.5% of the metrics have a low RMSE → good predictions

9 / 22

Page 30: Online Metrics Prediction in Monitoring Systems · Online Metrics Prediction in Monitoring Systems Matthieu Caneill, No el De Palma, Ali Ait-Bachir Bastien Dine, Rachid Mokhtari ...

Metrics prediction

Metric blacklisting

I Some metrics are too volatile and hard to predict

I To avoid false positives/negatives, and save resources, theyare blacklisted

I Root Mean Square Error evaluated weekly

I Metrics (temporarily) blacklisted if their RMSE > threshold

I 58.5% of the metrics have a low RMSE → good predictions

9 / 22

Page 31: Online Metrics Prediction in Monitoring Systems · Online Metrics Prediction in Monitoring Systems Matthieu Caneill, No el De Palma, Ali Ait-Bachir Bastien Dine, Rachid Mokhtari ...

Metrics prediction

Metric blacklisting

I Some metrics are too volatile and hard to predict

I To avoid false positives/negatives, and save resources, theyare blacklisted

I Root Mean Square Error evaluated weekly

I Metrics (temporarily) blacklisted if their RMSE > threshold

I 58.5% of the metrics have a low RMSE → good predictions

9 / 22

Page 32: Online Metrics Prediction in Monitoring Systems · Online Metrics Prediction in Monitoring Systems Matthieu Caneill, No el De Palma, Ali Ait-Bachir Bastien Dine, Rachid Mokhtari ...

Metrics prediction

Metric blacklisting

I Some metrics are too volatile and hard to predict

I To avoid false positives/negatives, and save resources, theyare blacklisted

I Root Mean Square Error evaluated weekly

I Metrics (temporarily) blacklisted if their RMSE > threshold

I 58.5% of the metrics have a low RMSE → good predictions

9 / 22

Page 33: Online Metrics Prediction in Monitoring Systems · Online Metrics Prediction in Monitoring Systems Matthieu Caneill, No el De Palma, Ali Ait-Bachir Bastien Dine, Rachid Mokhtari ...

Metrics prediction

Metric blacklisting

I Some metrics are too volatile and hard to predict

I To avoid false positives/negatives, and save resources, theyare blacklisted

I Root Mean Square Error evaluated weekly

I Metrics (temporarily) blacklisted if their RMSE > threshold

I 58.5% of the metrics have a low RMSE → good predictions

9 / 22

Page 34: Online Metrics Prediction in Monitoring Systems · Online Metrics Prediction in Monitoring Systems Matthieu Caneill, No el De Palma, Ali Ait-Bachir Bastien Dine, Rachid Mokhtari ...

Metrics prediction

Example

0 2 4 6 8 10

2.3

2.4

2.5

2.6

2.7

time (hours)

metricvalue

past

Figure: swap memory

10 / 22

Page 35: Online Metrics Prediction in Monitoring Systems · Online Metrics Prediction in Monitoring Systems Matthieu Caneill, No el De Palma, Ali Ait-Bachir Bastien Dine, Rachid Mokhtari ...

Metrics prediction

Example

0 2 4 6 8 10

2.3

2.4

2.5

2.6

2.7

time (hours)

metricvalue

past predicted

prediction is computed

Figure: swap memory

10 / 22

Page 36: Online Metrics Prediction in Monitoring Systems · Online Metrics Prediction in Monitoring Systems Matthieu Caneill, No el De Palma, Ali Ait-Bachir Bastien Dine, Rachid Mokhtari ...

Metrics prediction

Example

0 2 4 6 8 10

2.3

2.4

2.5

2.6

2.7

time (hours)

metricvalue

past predicted future

prediction is computed

Figure: swap memory

10 / 22

Page 37: Online Metrics Prediction in Monitoring Systems · Online Metrics Prediction in Monitoring Systems Matthieu Caneill, No el De Palma, Ali Ait-Bachir Bastien Dine, Rachid Mokhtari ...

Metrics prediction

Example

0 2 4 6 8 102

2.1

2.2

2.3

2.4

time (hours)

metricvalue

past

Figure: physical memory

11 / 22

Page 38: Online Metrics Prediction in Monitoring Systems · Online Metrics Prediction in Monitoring Systems Matthieu Caneill, No el De Palma, Ali Ait-Bachir Bastien Dine, Rachid Mokhtari ...

Metrics prediction

Example

0 2 4 6 8 102

2.1

2.2

2.3

2.4

time (hours)

metricvalue

past predicted

Figure: physical memory

11 / 22

Page 39: Online Metrics Prediction in Monitoring Systems · Online Metrics Prediction in Monitoring Systems Matthieu Caneill, No el De Palma, Ali Ait-Bachir Bastien Dine, Rachid Mokhtari ...

Metrics prediction

Example

0 2 4 6 8 102

2.1

2.2

2.3

2.4

time (hours)

metricvalue

past predicted future

Figure: physical memory

11 / 22

Page 40: Online Metrics Prediction in Monitoring Systems · Online Metrics Prediction in Monitoring Systems Matthieu Caneill, No el De Palma, Ali Ait-Bachir Bastien Dine, Rachid Mokhtari ...

Metrics prediction

Example

0 5 10 15 20688

690

692

694

696Warning

time (hours)

metricvalue

past

Figure: disk partition

12 / 22

Page 41: Online Metrics Prediction in Monitoring Systems · Online Metrics Prediction in Monitoring Systems Matthieu Caneill, No el De Palma, Ali Ait-Bachir Bastien Dine, Rachid Mokhtari ...

Metrics prediction

Example

0 5 10 15 20688

690

692

694

696Warning

time (hours)

metricvalue

past predicted

raise alert: diskfull in 10 hours

Figure: disk partition

12 / 22

Page 42: Online Metrics Prediction in Monitoring Systems · Online Metrics Prediction in Monitoring Systems Matthieu Caneill, No el De Palma, Ali Ait-Bachir Bastien Dine, Rachid Mokhtari ...

Metrics prediction

Example

0 5 10 15 20688

690

692

694

696Warning

time (hours)

metricvalue

past predicted future

raise alert: diskfull in 10 hours

Figure: disk partition

12 / 22

Page 43: Online Metrics Prediction in Monitoring Systems · Online Metrics Prediction in Monitoring Systems Matthieu Caneill, No el De Palma, Ali Ait-Bachir Bastien Dine, Rachid Mokhtari ...

Evaluation

Setup

I Hardware: 4 servers (16–28 cores, 128–256 GB RAM)

I Dataset: Replay on production data recorded at Coservit

I 424 206 metrics, 1.5 billion data points monitored on 25 070servers

13 / 22

Page 44: Online Metrics Prediction in Monitoring Systems · Online Metrics Prediction in Monitoring Systems Matthieu Caneill, No el De Palma, Ali Ait-Bachir Bastien Dine, Rachid Mokhtari ...

Evaluation

CPU load and memory consumption

0 200 400 600 8000%20%40%60%80%

100%120%

time (seconds)

CPU memory

(a) master

0 200 400 600 8000%20%40%60%80%100%120%

time (seconds)

CPU memory

(b) slave-1

Figure: Running on 4 machines and 100 cores for 15 minutes.

14 / 22

Page 45: Online Metrics Prediction in Monitoring Systems · Online Metrics Prediction in Monitoring Systems Matthieu Caneill, No el De Palma, Ali Ait-Bachir Bastien Dine, Rachid Mokhtari ...

Evaluation

Time repartition

load createdataframe

train predictwtfsavewtfpublish0

100

200

300

400

500time(m

s)

Figure: Time repartition for predicting a metric.

15 / 22

Page 46: Online Metrics Prediction in Monitoring Systems · Online Metrics Prediction in Monitoring Systems Matthieu Caneill, No el De Palma, Ali Ait-Bachir Bastien Dine, Rachid Mokhtari ...

Evaluation

Load handling

I End-to-end process for the prediction of 1 metric: 1 second.

I One monitoring server (with 24 cores) can handle the load of1440 metrics (at worst), which is 85 servers on average.

16 / 22

Page 47: Online Metrics Prediction in Monitoring Systems · Online Metrics Prediction in Monitoring Systems Matthieu Caneill, No el De Palma, Ali Ait-Bachir Bastien Dine, Rachid Mokhtari ...

Evaluation

Load handling

I End-to-end process for the prediction of 1 metric: 1 second.

I One monitoring server (with 24 cores) can handle the load of1440 metrics (at worst), which is 85 servers on average.

16 / 22

Page 48: Online Metrics Prediction in Monitoring Systems · Online Metrics Prediction in Monitoring Systems Matthieu Caneill, No el De Palma, Ali Ait-Bachir Bastien Dine, Rachid Mokhtari ...

Evaluation

Load handling: linear scaling

0 20 40 60 80 100 120 1400

20

40

60

80

100

120

CPU cores

processed

metrics

(x1000) 1 slave

Figure: Amount of metrics handled in 15 minutes.

17 / 22

Page 49: Online Metrics Prediction in Monitoring Systems · Online Metrics Prediction in Monitoring Systems Matthieu Caneill, No el De Palma, Ali Ait-Bachir Bastien Dine, Rachid Mokhtari ...

Evaluation

Load handling: linear scaling

0 20 40 60 80 100 120 1400

20

40

60

80

100

120

CPU cores

processed

metrics

(x1000) 1 slave

2 slaves

Figure: Amount of metrics handled in 15 minutes.

17 / 22

Page 50: Online Metrics Prediction in Monitoring Systems · Online Metrics Prediction in Monitoring Systems Matthieu Caneill, No el De Palma, Ali Ait-Bachir Bastien Dine, Rachid Mokhtari ...

Evaluation

Load handling: linear scaling

0 20 40 60 80 100 120 1400

20

40

60

80

100

120

CPU cores

processed

metrics

(x1000) 1 slave

2 slaves3 slaves

Figure: Amount of metrics handled in 15 minutes.

17 / 22

Page 51: Online Metrics Prediction in Monitoring Systems · Online Metrics Prediction in Monitoring Systems Matthieu Caneill, No el De Palma, Ali Ait-Bachir Bastien Dine, Rachid Mokhtari ...

Conclusion

Related work

Positioning

No published work exhibits the same system (end-to-end systemfor monitoring metrics prediction, storage and blacklisting).

Prediction models

I Hardware failures [CAS12]

I Capacity planning (e.g. Microsoft Azure [mic])

I Datacenter temperature (e.g. Thermocast [LLL+11])

I Monitoring metrics (e.g. Zabbix [zab] with manual tuning)

18 / 22

Page 52: Online Metrics Prediction in Monitoring Systems · Online Metrics Prediction in Monitoring Systems Matthieu Caneill, No el De Palma, Ali Ait-Bachir Bastien Dine, Rachid Mokhtari ...

Conclusion

Related work

Positioning

No published work exhibits the same system (end-to-end systemfor monitoring metrics prediction, storage and blacklisting).

Prediction models

I Hardware failures [CAS12]

I Capacity planning (e.g. Microsoft Azure [mic])

I Datacenter temperature (e.g. Thermocast [LLL+11])

I Monitoring metrics (e.g. Zabbix [zab] with manual tuning)

18 / 22

Page 53: Online Metrics Prediction in Monitoring Systems · Online Metrics Prediction in Monitoring Systems Matthieu Caneill, No el De Palma, Ali Ait-Bachir Bastien Dine, Rachid Mokhtari ...

Conclusion

Related work

Positioning

No published work exhibits the same system (end-to-end systemfor monitoring metrics prediction, storage and blacklisting).

Prediction models

I Hardware failures [CAS12]

I Capacity planning (e.g. Microsoft Azure [mic])

I Datacenter temperature (e.g. Thermocast [LLL+11])

I Monitoring metrics (e.g. Zabbix [zab] with manual tuning)

18 / 22

Page 54: Online Metrics Prediction in Monitoring Systems · Online Metrics Prediction in Monitoring Systems Matthieu Caneill, No el De Palma, Ali Ait-Bachir Bastien Dine, Rachid Mokhtari ...

Conclusion

Related work

Positioning

No published work exhibits the same system (end-to-end systemfor monitoring metrics prediction, storage and blacklisting).

Prediction models

I Hardware failures [CAS12]

I Capacity planning (e.g. Microsoft Azure [mic])

I Datacenter temperature (e.g. Thermocast [LLL+11])

I Monitoring metrics (e.g. Zabbix [zab] with manual tuning)

18 / 22

Page 55: Online Metrics Prediction in Monitoring Systems · Online Metrics Prediction in Monitoring Systems Matthieu Caneill, No el De Palma, Ali Ait-Bachir Bastien Dine, Rachid Mokhtari ...

Conclusion

Future work

I Experiment with more complex ML algorithms

I Predictions on long-term global trends

I Link with ticketing mechanism

19 / 22

Page 56: Online Metrics Prediction in Monitoring Systems · Online Metrics Prediction in Monitoring Systems Matthieu Caneill, No el De Palma, Ali Ait-Bachir Bastien Dine, Rachid Mokhtari ...

Thanks! Questions?

Page 57: Online Metrics Prediction in Monitoring Systems · Online Metrics Prediction in Monitoring Systems Matthieu Caneill, No el De Palma, Ali Ait-Bachir Bastien Dine, Rachid Mokhtari ...

Bibliography I

T. Chalermarrewong, T. Achalakul, and S. C. W. See.Failure prediction of data centers using time series and faulttree analysis.In 2012 IEEE 18th International Conference on Parallel andDistributed Systems, pages 794–799, Dec 2012.

Lei Li, Chieh-Jan Mike Liang, Jie Liu, Suman Nath, AndreasTerzis, and Christos Faloutsos.Thermocast: A cyber-physical forecasting model fordatacenters.In Proceedings of the 17th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining, KDD’11, pages 1370–1378, New York, NY, USA, 2011. ACM.

21 / 22


Recommended