+ All Categories
Home > Documents > 1 Thermal Prediction for Efficient Energy Management of ... · data center operations. Presence of...

1 Thermal Prediction for Efficient Energy Management of ... · data center operations. Presence of...

Date post: 24-Jul-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
13
1 Thermal Prediction for Efficient Energy Management of Clouds using Machine Learning Shashikant Ilager, Kotagiri Ramamohanarao, Rajkumar Buyya Abstract—Thermal management in the hyper-scale cloud data centers is a critical problem. Increased host temperature creates hotspots which significantly increases cooling cost and affects reliability. Accurate prediction of host temperature is crucial for managing the resources effectively. Temperature estimation is a non-trivial problem due to thermal variations in the data center. Existing solutions for temperature estimation are inefficient due to their computational complexity and lack of accurate prediction. However, data-driven machine learning methods for temperature prediction is a promising approach. In this regard, we collect and study data from a private cloud and show the presence of thermal variations. We investigate several machine learning models to accurately predict the host temperature. Specifically, we propose a gradient boosting machine learning model for temperature prediction. The experiments results show that our model accurately predicts the temperature with the average RMSE value of 0.05 or average prediction error of 2.38 °C , which is 6 °C less as compared to an existing theoretical model. In addition, we propose a dynamic scheduling algorithm to minimize the peak temperature of hosts. The results show that our algorithm reduces the peak temperature by 6.5 °C and consumes 34.5% less energy as compared to the baseline algorithm. Index Terms—Cloud computing, Machine learning, Energy efficiency in a data center, Data center cooling, Hotspots 1 I NTRODUCTION The transition from ownership based on-premise IT in- frastructure to subscription-based Cloud has been tremen- dous in the past decade due to the vast advantages that cloud computing offers [1]. This rapid proliferation of cloud has resulted in a massive number of hyper-scale data centers that generate an exorbitant amount of heat and consume a large amount of electrical energy. According to [2], around 2% global electricity generated is spent on data centers and almost 50% of this energy is spent on cooling system [3]. Modern cloud data center’s rack-mounted servers can consume up to 1000 watts of power and attain peak temper- ature as high as 100 °C [4]. The power consumed by a host is dissipated as heat to the ambient environment, and the cooling system is equipped to remove this heat and keep the host’s temperature below the threshold. Increased host temperature is a bottleneck for the normal operation of a data center as it escalates the cooling cost. It also creates the hotspots that severely affects the reliability of the sys- tem due to cascading failures caused by silicon component damage. The report from Uptime Institute [5] shows that the failure rate of equipment doubles for every increase of 10 °C above 21 °C. Hence, thermal management becomes a crucial process inside the data center Resource Management System (RMS). Therefore, to minimize the risk of peak temperature repercussions, and reduce a significant amount of energy consumption, ideally, we need accurate predictions of ther- mal dissipation and power consumption of hosts based on Shashikant Ilager, Kotagiri Ramamohanarao and Rajkumar Buyya are with The Cloud Computing and Distributed Systems (CLOUDS) Laboratory, School of Computing and Information Systems, The University of Melbourne, Melbourne, VIC 3010, Australia. (e-mail: [email protected], {kotagiri, rbuyya}@unimelb.edu.au). workload level and a scheduler that efficiently schedules the workloads with these accurate predictions using some scheduling policies. However, accurate prediction of a host temperature in a steady state data center is a non-trivial problem [6], [7]. This is extremely challenging due to com- plex and discrepant thermal behavior associated with com- puting and cooling systems. Such variations in a data center are usually enforced by CPU frequency throttling mecha- nisms guided by Thermal Design Power (TDP), attributes associated with hosts such as its physical location, distance from the cooling source, and also thermodynamic effects like heat recirculation [6], [7]. Hence, the estimation of the host temperature in the presence of such discrepancies is vital to efficient thermal management. Sensors are deployed on both the CPU and rack level to sense the CPU and ambient temperature, respectively. However, these sensors are useful to read the current thermal status. Consequently, predicting future temperature based on the change in work- load level is equally necessary for critically important RMS tasks such as resource provisioning, scheduling, and setting the cooling system parameters. Existing approaches to predict the temperature are inac- curate, complex or computationally expensive. The widely used theoretical analytical models [6], [7], [8], [9], [10] that are built based on mathematical relations between different cyber-physical components lack the scalability and accurate prediction of the actual temperature. In addition, theoretical models fail to consider several variables that contribute towards temperature behavior and they need to be changed for different data centers. Computational Fluid Dynamics (CFD) models are also predominantly used [11], [12] for accurate predictions, but their high complexity requires a large number of computing cycles. Building these CFD models and executing them can take hours or days, based on individual data center complexity [13]. The CFD models
Transcript
Page 1: 1 Thermal Prediction for Efficient Energy Management of ... · data center operations. Presence of multi-tenant users and their heterogeneous workloads exhibit non-coherent behav-ior

1

Thermal Prediction for Efficient EnergyManagement of Clouds using Machine Learning

Shashikant Ilager, Kotagiri Ramamohanarao, Rajkumar Buyya

Abstract—Thermal management in the hyper-scale cloud data centers is a critical problem. Increased host temperature createshotspots which significantly increases cooling cost and affects reliability. Accurate prediction of host temperature is crucial formanaging the resources effectively. Temperature estimation is a non-trivial problem due to thermal variations in the data center.Existing solutions for temperature estimation are inefficient due to their computational complexity and lack of accurate prediction.However, data-driven machine learning methods for temperature prediction is a promising approach. In this regard, we collect andstudy data from a private cloud and show the presence of thermal variations. We investigate several machine learning models toaccurately predict the host temperature. Specifically, we propose a gradient boosting machine learning model for temperatureprediction. The experiments results show that our model accurately predicts the temperature with the average RMSE value of 0.05 oraverage prediction error of 2.38 °C , which is 6 °C less as compared to an existing theoretical model. In addition, we propose adynamic scheduling algorithm to minimize the peak temperature of hosts. The results show that our algorithm reduces the peaktemperature by 6.5 °C and consumes 34.5% less energy as compared to the baseline algorithm.

Index Terms—Cloud computing, Machine learning, Energy efficiency in a data center, Data center cooling, Hotspots

F

1 INTRODUCTION

The transition from ownership based on-premise IT in-frastructure to subscription-based Cloud has been tremen-dous in the past decade due to the vast advantages thatcloud computing offers [1]. This rapid proliferation of cloudhas resulted in a massive number of hyper-scale data centersthat generate an exorbitant amount of heat and consume alarge amount of electrical energy. According to [2], around2% global electricity generated is spent on data centers andalmost 50% of this energy is spent on cooling system [3].

Modern cloud data center’s rack-mounted servers canconsume up to 1000 watts of power and attain peak temper-ature as high as 100 °C [4]. The power consumed by a hostis dissipated as heat to the ambient environment, and thecooling system is equipped to remove this heat and keepthe host’s temperature below the threshold. Increased hosttemperature is a bottleneck for the normal operation of adata center as it escalates the cooling cost. It also createsthe hotspots that severely affects the reliability of the sys-tem due to cascading failures caused by silicon componentdamage. The report from Uptime Institute [5] shows thatthe failure rate of equipment doubles for every increase of10 °C above 21 °C. Hence, thermal management becomes acrucial process inside the data center Resource ManagementSystem (RMS).

Therefore, to minimize the risk of peak temperaturerepercussions, and reduce a significant amount of energyconsumption, ideally, we need accurate predictions of ther-mal dissipation and power consumption of hosts based on

• Shashikant Ilager, Kotagiri Ramamohanarao and Rajkumar Buyyaare with The Cloud Computing and Distributed Systems (CLOUDS)Laboratory, School of Computing and Information Systems, TheUniversity of Melbourne, Melbourne, VIC 3010, Australia. (e-mail:[email protected], kotagiri, [email protected]).

workload level and a scheduler that efficiently schedulesthe workloads with these accurate predictions using somescheduling policies. However, accurate prediction of a hosttemperature in a steady state data center is a non-trivialproblem [6], [7]. This is extremely challenging due to com-plex and discrepant thermal behavior associated with com-puting and cooling systems. Such variations in a data centerare usually enforced by CPU frequency throttling mecha-nisms guided by Thermal Design Power (TDP), attributesassociated with hosts such as its physical location, distancefrom the cooling source, and also thermodynamic effectslike heat recirculation [6], [7]. Hence, the estimation of thehost temperature in the presence of such discrepancies isvital to efficient thermal management. Sensors are deployedon both the CPU and rack level to sense the CPU andambient temperature, respectively. However, these sensorsare useful to read the current thermal status. Consequently,predicting future temperature based on the change in work-load level is equally necessary for critically important RMStasks such as resource provisioning, scheduling, and settingthe cooling system parameters.

Existing approaches to predict the temperature are inac-curate, complex or computationally expensive. The widelyused theoretical analytical models [6], [7], [8], [9], [10] thatare built based on mathematical relations between differentcyber-physical components lack the scalability and accurateprediction of the actual temperature. In addition, theoreticalmodels fail to consider several variables that contributetowards temperature behavior and they need to be changedfor different data centers. Computational Fluid Dynamics(CFD) models are also predominantly used [11], [12] foraccurate predictions, but their high complexity requires alarge number of computing cycles. Building these CFDmodels and executing them can take hours or days, basedon individual data center complexity [13]. The CFD models

Page 2: 1 Thermal Prediction for Efficient Energy Management of ... · data center operations. Presence of multi-tenant users and their heterogeneous workloads exhibit non-coherent behav-ior

2

are useful in intial desgin and caliberation of data centerlayout and cooling settings, however, it is infeasible for thereal time taks such as scheduling in large scale clouds thatare dynamic and require quick online decisions. Moreover,CFD simulation requires both compuational (e.g, layout ofthe Data Center, open tiles) and physical parametes andchanges to these parameters needs expensive retraining ofthe models [14]. However, our appraoch is fast and cost-effective as it solely relies on the phsycial sensor data thatare readily available on any rack mounted servers andimplictly captures variations. Hence, data-driven methodsusing machine learning techniques is a promising approachto predict the host temperature quickly and accurately.

Machine learning (ML) techniques have become per-vasive in modern digital society mainly in computer vi-sion and natural language processing applications. Withthe advancement in machine learning algorithms and theavailability of sophisticated tools, applying these ML tech-niques to optimize the large scale computing systems is apropitious avenue [15], [16], [17], [18]. Recently, Google hasreported a list of their efforts that are put in this direc-tion [19], where they optimize several of their large scalecomputing systems using ML to reduce cost, energy andincrease the performance. Data-driven temperature predic-tions are highly suitable as they are built from actual mea-surements and they capture the important variations thatare induced by different factors in data center environments.Furthermore, recent works have explored ML techniques topredict the data center host temperature [6], [20]. However,these works are applied to HPC data centers or similarinfrastructure that relies on both application and physicallevel features to train the models. In addition, they areapplication specific temperature estimation. Nevertheless,the presence of the virtualization layer in Infrastructureclouds prohibits this application-specific approach due toan isolated execution environment provided to users. More-over, getting access to the application features is impracticalin clouds because of privacy and security agreements be-tween users and cloud providers. Consequently, we presenta host temperature prediction model that completely relieson features that can be directly accessed from physical hostsand independent of the application counters.

In this regard, we collect and study data from our Uni-versity’s private research cloud. We propose a data-driventemperature prediction models based on this collected data.We use this data to build the ML-based models that can beused to predict the temperature of hosts in runtime. Accord-ingly, we investigated several ML algorithms including vari-ants of regression models, a neural network model namelyMultilayer Perceptron (MLP), and ensemble learning mod-els. Based on the experimental results, the ensemble-basedlearning, gradient boosting method, specifically, XGBoost[21] is chosen for temperature prediction. The proposed pre-diction model has high accuracy with an average predictionerror of 2.5 °C and Root Mean Square Error (RMSE) of0.05. Furthermore, guided by these prediction models, wepropose a dynamic scheduling algorithm to minimize thepeak temperature of hosts in a data center. The schedulingalgorithm is evaluated based on real-world workload tracesand it is capable of circumventing potential hotspots andsignificantly reduces the total energy consumption of a data

Nrx

Ntx

R CPU

Pc Tcpu

1Tc

pu2

Rx Ncpu

xNv

mfs

1fs

2fs

3fs

4Ti

nlet

NrxNtx

RCPU

PcTcpu1Tcpu2

RxNcpux

Nvmfs1fs2fs3fs4

Tinlet 1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

(a) Correlation between allfeatures

30 40 50 60 70 80CPU Temperature ( C)

0.000

0.005

0.010

0.015

0.020

0.025

0.030

0.035

Dens

ity

(b) CPU temperaturedistribution

Fig. 1: Feature set Correlation and temperature distribution

center. The results have demonstrated the feasibility of ourproposed prediction models and scheduling algorithm indata center RMS.

In summary, the key contributions of our work are:

• We collect physical-host level measurements from areal-world data center and show the thermal andenergy consumption variations between hosts undersimilar resource consumption and cooling settings.

• We build machine learning based temperature pre-diction models using fine-grained measurementsfrom the collected data.

• We show the accuracy and the feasibility of proposedprediction models with extensive empirical evalua-tion.

• We propose a dynamic workload scheduling algo-rithm guided by the prediction methods to reducethe peak temperature of the data center that min-imizes the total energy consumption under rigidthermal constraints.

The remainder of the paper is organized as follows: Themotivations for this work and thermal implications in thecloud are explained in Section 2. Section 3 proposes athermal prediction framework and explores different MLalgorithms. Section 4 describes the gradient boosting basedprediction model. The feasibility of prediction model isevaluated against the theoretical model in Section 5. Section6 presents a dynamic scheduling algorithm. The analysis ofscheduling algorithm results are done in Section 7 and thefeature set analysis is described in Section 8. The relevantliterature for this work is discussed in Section 9. Finally,Section 10 concludes the paper and also draws the futuredirections.

2 MOTIVATION: INTRICACIES IN CLOUD DATACENNTER’S THERMAL MANAGEMENT

Thermal management is a critical component in a clouddata center operations. Presence of multi-tenant users andtheir heterogeneous workloads exhibit non-coherent behav-ior with respect to thermal and power consumption ofhosts in a cloud data center. Reducing even one degreeof temperature in cooling saves millions of dollar over theyear in large scale data centers [17]. In addition, most ofthe data centers and servers have already equipped withinfrastructure to monitor that have several sensors to read

Page 3: 1 Thermal Prediction for Efficient Energy Management of ... · data center operations. Presence of multi-tenant users and their heterogeneous workloads exhibit non-coherent behav-ior

3

the workload, power, and thermal parameters. Using thisdata to predict the temperature is cost effective and feasible.To that end, to analyze these complex relationships betweendifferent parameters that influence the host temperature,we collected data from a private cloud and studied it forintrinsic information. This data includes resource usage andsensor data of power, thermal and fan speed readings ofhosts. The detail information about the data and collectionmethod is described in Section 3.2.

The correlation between different parameters (Table 1)and temperature distribution in the data center can beobserved in Figure 1a and 1b. These figures are drawnfrom the data including records of 75 hosts collected for90 days period with a logging interval of 10 minutes (i.e,75 × 90 × 24 × 6 records). The correlation plot in Figure1a is based on the standard pairwise Pearson correlationcoefficient represented as a heat map. Here, the correlationvalue ranges from -1 to 1, where the value is near to 1 forhighly correlated features, 0 for no correlation, and -1 forthe negative correlation. For better illustration, the valuesare represented as color shades as shown in the figure.In addition, the correlation matrix is clustered based onpairwise Euclidean distance to enhance interpretability. Itis evident that the CPU temperature of a host is highlyinfluenced by power consumption and CPU load. However,factors like memory usage and machine fan speeds alsohave some degree of interdependence with it. Additionally,inlet temperature has a positive correlation with fan speedsand number of VMs running on a host.

The high number of hosts operating at a peak CPUtemperature can be observed from Figure 1b. The figurerepresents a histogram of the temperature distribution ofall hosts. Here, bins on the x axis is a CPU temperatureand y axis represents probability density. CPU temperatureof hosts can reach more than 80 °C and the occurrenceof such conditions are numerous which is evidenced byhigh-density value on the y axis for the respective bin. Inaddition, hosts exhibit inconsistent thermal behavior basedon several factors. This non-linear behavior of hosts presentsa severe challenge in temperature estimation. A single theo-retical mathematical model, applied even for homogeneousnodes, fails to accurately predict the temperature. Two ho-mogeneous nodes at a similar CPU load observes differentCPU temperature. For instance, at a CPU load of 50% of thedifferent hosts in our data set, CPU temperature varies upto 14 °C. Furthermore, with similar cooling settings, inlettemperature also varies up to 9 °C between hosts. Thesetemperature variations are caused by factors like physicalattributes such as the host’s location, thermodynamic ef-fects, heat recirculation, and thermal throttling mechanismsinduced by the operating system based on workload behav-iors [6]. Therefore, a temperature estimation model shouldconsider these non-linear composite relationship betweenhosts.

Motivated by these factors, we try to rely on data-drivenprediction approaches compared to previously existed rigidanalytical and expensive CFD based methods. We use thecollected data to build the prediction models to accuratelyestimate the host temperature. Furthermore, guided bythese prediction models, we propose a simple dynamicscheduling algorithm to minimize the peak temperature in

Fig. 2: System model

the data center.

3 SYSTEM MODEL AND DATA-DRIVEN TEMPERA-TURE PREDICTION

In this section, we describe the system model and discussmethods and approaches for cloud data center temperatureprediction. We use these methods to further optimize ourprediction model in Section 4.

3.1 System Model

A system model for predictive thermal management in thecloud data center is shown in Figure 2. Resource Manage-ment System (RMS) at the middle-ware interacts with usersand thermal prediction module to efficiently manage theunderlying resources in cloud infrastructure. The predictionmodule consists of four main components, i.e, data collector,training the suitable model, validating the performance ofthe model, and finally deploying it for runtime usage. RMSin a data center can use these deployed models to efficientlymanage the resources and reduce the cost. The importantelements of the framework are discussed in the followingsubsections.

3.2 Data Collection

An ML-based prediction model is as good as the data it hasbeen used to train. In the data center domain, training datacan include application and physical level features to trainthe model [6]. The application features include instructioncount, number of CPU cycles, cache metrics (read, write andmiss), etc. Accordingly, physical features include host levelresource usage (CPU, RAM, I/O, etc.) and several sensorreadings (power, CPU temperature, fan speeds). Relyingon both of these features is feasible in bare metal HPCdata centers where administrators have exclusive accessto the application and physical features. However, in caseof Infrastructure as Service (IaaS) clouds, resources are

Page 4: 1 Thermal Prediction for Efficient Energy Management of ... · data center operations. Presence of multi-tenant users and their heterogeneous workloads exhibit non-coherent behav-ior

4

TABLE 1: Definition of features collected

Features DefinitionCPU CPU Load (%)R RAM- Random Access Memory (mb)Rx RAM in usage (mb)NCPU Number of CPU coresNCPUx Number of CPU cores in useNRx Network inbound traffic (Kbps)NTx Network outbound traffic (Kbps)Pc Power consumed by host (watts)Tcpu1 CPU 1 temperature (°C)Tcpu2 CPU 2 temperature (°C)fs1 fan1 speed (RPM)fs2 fan2 speed (RPM)fs3 fan3 speed (RPM)fs4 fan4 speed (RPM)Tin Inlet temperature (°C)Nvm Number of VMs running on host

virtualized and provisioned as VMs or containers, thus,giving users exclusive isolated access to the applicationexecution environment. The presence of a hypervisor orcontainer-based virtualization in IaaS clouds restricts accessto application-specific features. Moreover, a diverse set ofusers in the cloud have a different type of workloadsexhibiting different application behaviors which impedecloud RMS to rely on application-specific features. As aconsequence, to predict host temperature, it is required tomake use of more fine-grained resource usage and physicalfeatures of the host system that can be directly accessed fromthe physical hosts without requiring to rely on applicationfeatures. In this regard, we show that this data is adequateto predict the host temperature accurately.

The representative data is collected from the Universityof Melbourne’s private research cloud 1. This computinginfrastructure provides computing facilities to students andresearchers as a virtual machine (VM). We collect data froma subset of the total machines in this cloud. The briefinformation about this data is included in Table 3. It includeslogs of 75 physical hosts having an average number of 650VMs. The data is recorded for a period of 3 months and thelog interval is set to 10 minutes. The total count of resourcesincludes 9600 CPU cores and 38692 GB of memory. Afterdata filtration and cleaning, the final dataset contains 984712tuples, each host approximately having around 13000 tu-ples. Each tuple contains 16 features including resourceand usage metrics, power, thermal, and fan speed sensorsmeasurements. The details of these features are given inTable 1. Since the hosts have two CPUs, therefore, we havetwo CPU temperature measurements. In addition, they havefour onboard fans for cooling. The reason to collect data fora long period is to capture all the dynamics and variationsof parameters to train the model effectively. This can be onlypossible when host resources have experienced differentusage level over time, model built over such data allowsaccurate prediciton in dynamic workload conditions. Anoverview of variations of all these parameters is depictedin Table 2 ( NCPU and R are not included as they representconstant resource capacity).

To collect this data, we use collectd2 daemon running on

1. https://tinyurl.com/y2q9o9vc2. https://collectd.org/

all hosts in the data center, which is a standard open sourceapplication that collects system and application perfor-mance counters periodically through system interfaces suchas IPMI and sensors. These metrics are accessed throughnetwork API’s and stored in a centralized server in the CSVformat. We used several bash and python scripts to pre-process the data. Specifically, python pandas 3 package toclean and sanitize the data. The missing entries and entrieswith NaN values are removed. For the broader use ofthis data to the research community and for the sake ofreproducibility, we will publish the data and scripts usedin this work.

3.3 Prediction Algorithms

The choice of regression-based algorithms for our prob-lem is natural since our aim is to estimate the numericaloutput variable i.e, temperature. In this regard., we focuson suitable candidate algorithms. We have explored differ-ent ML algorithms including different form of regressiontechniques, Linear Regression (LR), Bayesian Regression(BR), Lasso Linear Regression (LLR), Stochastic GradientDescent regression (SGD), an Artificial Neural Network(ANN) model called Multilayer Perceptron (MLP), andlastly ensemble learning technique called gradient boosting,specifically, eXtreme Gradient Boosting (XGBoost).

Our aim is to build a model for each host to accuratelycapture the thermal behavior of it as temperature largelyvaries between hosts due to many factors that we havealready discussed. For that reason, instead of solely pre-dicting CPU temperature, in our prediction, we predict thehost ambient temperature (Tamb) which is a combinationof inlet temperature and CPU temperature [22]. Since ourhosts have dual CPU, we consider a maximum CPU tem-perature among two as a CPU temperature. The reason toconsider ambient temperature instead of CPU temperatureis manifold. First, by combining the inlet and CPU temper-ature, it is feasible to capture thermal variations that areinduced by both the inlet and CPU temperature (cause ofthese variations are discussed in Section 2). Second, at adata center level, cooling settings knobs are adjusted basedon host ambient temperature rather than individual CPUtemperature [13]. In addition, resource management sys-tems in data center consider host level ambient temperatureas a threshold parameter whereas operating system levelresource management techniques rely on CPU temperature.

Therefore, to build the prediction model for individualhosts, we parse the data set and partition it based on hostIDs. For each individual host, the feature set now includesnumber of instances with each instance having these fea-tures (CPU , R, Rx, NCPU , NCPUx, NRx, NTx, Nvm, Pc,fs1−fs4, Tamb). Note that, we have excluded inlet and CPUtemperatures from the list, as we have combined these asambient temperature (Tamb) which is our target predictionvariable.

We used sci-kit learn package [23] to implement all thealgorithms, and for XGBoost, we used standard pythonpackage 4 that is available on Github. The parameters for

3. https://pandas.pydata.org/4. https://github.com/dmlc/xgboost

Page 5: 1 Thermal Prediction for Efficient Energy Management of ... · data center operations. Presence of multi-tenant users and their heterogeneous workloads exhibit non-coherent behav-ior

5

TABLE 2: Description of the feature set variations in the dataset (aggregated from all the hosts)

CPU(%) Rx NRx NTx Nvm NCPUx Pc fs2 fs1 fs3 fs4 Tcpu1 Tcpu2 Tin

Min 0 3974 0 0 0 0 55.86 5636 5686 5688 5645 29.14 25.46 13.33Max 64.74 514614 583123.08 463888.76 21 101 380.53 13469 13524 13468 13454 82 75.96 18.05Mean 18.09 307384.48 2849.00 1354.164 9 54 222.73 9484 9501 9490 9480 59.50 50.78 25.75

TABLE 3: Private cloud data collected for this work

#Hosts #VMs Total CPUCores

TotalMemory

CollectionPeriod

CollectionInterval

75 650 9600 38692 GB 90 days 10 Minute

Regression Regression_BR Regression_SGD Regression_Ridge MLP XGBoost0.00

0.02

0.04

0.06

0.08

0.10

RMSE

Val

ues

Fig. 3: Average prediction error between different models

each of the algorithm are self-explanatory in our imple-mentation, and for MLP, it follows a standard 3 layersarchitecture, with the number of neurons at a hidden layerset to 5 and a single output neuron, we used ’ReLu’ as theactivation function here.

To avoid overfitting of the models, we adopt k-fold crossvalidation where the value of k is set to 10. Furthermore, toevaluate the goodness of fit for different models, we use theRoot Mean Square Error (RMSE) metric which is a standardevaluation metric in regression-based problems [24]. TheRMSE is defined as follows.

RMSE =

√1

nΣn

i=1

(yi − yi

)2(1)

In Equation 1, yi is observed value, yi is the predicted outputvariable by prediction model, and n is the total numberof predictions. The value of RMSE represents the standarddeviation of the residuals or prediction errors. This alsoindicates how far are the data points from the model fittedline, in other words, it is a measure to show how spread outthe residuals are. Thus, lower RMSE values are preferred.

The performance of different algorithms is shown inFigure 3. These results are an average of all the host’sprediction model results. In Figure 3, we can observe thatXGBoost has a very low RMSE value, indicating that, theresiduals or prediction errors are less and its predictionsare more accurate. We observed that MLP has a high errorvalue compared to other algorithms. In addition, differentregression variants have performed almost similar to eachother. As the gradient boosting method XGBoost results arepromising, we focus more on this algorithm to explore itfurther, optimize and adapt it for further scheduling usecase study explained in Section 6.

4 LEARNING WITH EXTREME GRADIENT BOOST-ING

Boosting is an ensemble-based machine learning methodwhich builds strong learners based on weak learners. Gra-dient boosting is an ensemble of weak learners, usuallydecision trees. XGBoost (eXtreme Gradient Boosting) is ascalable, fast and efficient gradient boosting variant fortree boosting proposed by Chen et al [21]. It incorporatesmany advanced techniques to increase the performance,such as parallelism in tree creation, cache optimization withbetter data structure, and out of core computation usingblock compression and block sharing techniques which isacute in training large data sets due to regular memoryoverflow. Accordingly, the impact of XGBoost is proven byits dominant adoption in many Kaggle competitions andalso in large scale production systems.

The XGBoost algorithm is an ensemble of K Classifi-cation or Regression Trees (CART) [21]. This can be usedfor both classification and regression purpose. The model istrained by using an additive strategy. For a dataset with ninstances and m features, ensemble model uses k additivefunctions to estimate the output. Here, x being a set of inputfeatures, x = x1, x2, ...xm and y is the target predictionvariable.

yi = φ (xi) =K∑

k=1

fk (xi), fk ∈ F (2)

In the Equation 2, F is space of all the regression trees,i.e, F = f (x) = wq(x), and

(q : Rm → T,w ∈ RT

). Here,

q is the structure of each tree which maps to correspondingleaf index. T represents total number of leaves in the tree.each fk represents independent tree with structure q andleaf weightsw. To learn the set of of functions used in model,XGBoost minimizes the following regularized objective.

ζ (φ) =∑i

l (yi, yi) +∑k

Ω (fk),

where Ω (f) = γT = 12

λ‖ w ‖2 (3)

In the Equation 3, the first term l is the differentiableconvex loss function that calculates the difference betweenpredicted value yi observed value yi and, Ω penalizes thecomplexity of the model. The second part of Equation 3 iscalled regularization expression and it controls over fittingwhere T is the number nodes in the tree and w assignedvalues for each leaf node of the tree. This regularizedobjective function instinctively attempts to select a modelbased on simple and predictive functions.

we use the grid search technique to find the optimal pa-rameters to further enhance the performance of the model.The optimal values for gamma are 0.5, the learning rateis 0.1, maximum depth of the tree is 4, minimum childweight is 4, and the subsample ratio is 1, and rest of the

Page 6: 1 Thermal Prediction for Efficient Energy Management of ... · data center operations. Presence of multi-tenant users and their heterogeneous workloads exhibit non-coherent behav-ior

6

0 20 40 60 80 100Number of predictions

30

35

40

45

50

55

CPU

tem

pera

ture

(C)

ActualPrediction modelTheoretical model

(a) Temperature estimation compared to actual values

0 200 400 600 800 1000Number of predictions

0

5

10

15

20

25

|erro

r| (

C)

Theoretical modelPrediction model

(b) Rank order of prediction errors

Fig. 4: Comparison of prediction and theoretical model

parameters are set to default. Here, the gamma parameteris used to decide minimum loss reduction required to makea further partition on a leaf node of the tree. Subsampleratio decides the amount of sampling selected from trainingdata to grow the trees. Accordingly, with these settings, thebest RMSE value achieved is 0.05. It is important to notethat the prediction based temperature estimation is feasiblefor any data center given the historical data collected fromthe individual data center and we expect this way of build-ing prediction model outperforms any theoretically basedmodel, especially in very large data centers.

5 EVALUATING THE PREDICTION MODEL WITHTHEORETICAL MODEL

To evaluate the feasibility of our temperature predictionmodels, we compare the prediction results to extensivelyused theoretical analytical model [9], [7], [8]. Here, thetemperature estimation is based on the RC model whichis formulated from the analytical methods. The temperatureof a host (T ) is calculated based on the following equation.

T = PR+ Tin + (Tinitial − PR− Tin)× e− tRC (4)

In Equation 4, P is the dynamic power of host, R and C arethermal resistance (k/w) and heat capacity (j/k) of the hostrespectively. Tinitial is the initial temperature of the CPU.Since analytical models estimate CPU temperature, we alsopredict CPU temperature to compare the results instead ofambient temperature.

To compare the results, we randomly select 1000 in-stances of results from our whole dataset and analyze theresult between prediction and theoretical models. For thetheoretical model, the value of P and Tin are directly usedfrom our test data set. The value of thermal resistance (R)and heat capacity( C) is set as 0.34 K/w and 340 J/Krespectively and Tinitital is set to 318 K [9].

Performance of the two models in temperature estima-tion can be observed from Figure 4. For the sake of visibility,Figure 4a includes 100 instances of data. As the figure sug-gests, our proposed model based on XGBoost’s estimationis very close to the actual values, whereas the theoreticalmodel has a large variation from the actual values. Figure4b, represents a rank order of the absolute errors (fromactual temperature) of two models in terms of °C. Thetheoretical model deviates as far as 25 °C from the actualvalues. In this test, the average error of the theoretical modelis 9.33 °C and our prediction model is 2.38 °C. These results

reflect the feasibility of using prediction models over the-oretical models for temperature estimation. It is importantto note that, the prediction models need to be trained fordifferent data centers separately with well-calibrated data.Nevertheless, in the absence of such a facility, it is stillfeasible to use theoretical analytical models which rely ona minimum number of simple parameters.

6 DYNAMIC SCHEDULING GUIDED BY PREDICTIONMODELS

Applications of temperature predictions are numerous. Itcan be used to change the cooling settings such as supplyair temperature to save the cooling cost [22]. It is also usefulin identifying the thermal anomalies which increase the riskof failures and injects performance bottlenecks. Moreover,one foremost usage would be in a data center resourcemanagement system’s tasks such as resource provisioningand scheduling.

With the given historical host’s data, predictive mod-els are trained and deployed for runtime inference. Ascheduling algorithm invokes deployed prediction modelto accurately predict the host temperature. The input to theprediction model is a set of host’s features, in our model,the features can be easily collected from the host’s onboardsensors. These features are accessed from the host’s systeminterface through HTTP API’s. The complexity to retrievethis input feature set information is O(1). The latency ofthis operation depends on the data center local networkcapabilities. Moreover, the models need to be retrained onlywhen the new changes are introduced to the data centerenvironment, like, the addition of new hosts or change inthe physical location of hosts. Considering the fact thatsuch changes are not so frequent in a data center, the costof building and using such predictive models in resourcemanagement tasks like scheduling is highly feasible.

In this regard, we propose dynamic scheduling of VMsin a cloud data center based on the temperature predictionmodel we have proposed. Here, we intend to reduce thepeak temperature of the system while consolidating VMson fewest hosts as possible for each scheduling intervalwhich is a preferred way to reduce the energy in a clouddata center [25]. In this problem, n physical hosts in datacenter hosting m VMs at timestep t, the objective is toreduce the number of active hosts in a data center at t + 1by consolidating the VMs based on workload level. Thisconsolidation process inside the data center is extremelyimportant and carried out regularly to reduce overall data

Page 7: 1 Thermal Prediction for Efficient Energy Management of ... · data center operations. Presence of multi-tenant users and their heterogeneous workloads exhibit non-coherent behav-ior

7

center energy [26], [27]. This procedure mainly includesthree steps. First, identifying under loaded hosts from whichwe can potentially migrate VM and shut down the machineand also finding overloaded hosts and migrate VMs fromthem to reduce the risk of Service Level Agreements (SLA)violation, here, SLA is providing requested resources toVMs without degrading their performance. Second, select-ing VMs for migration from the over and under loaded hostsidentified in previous steps, and finally, identifying newtarget hosts to schedule the selected VMs. The schedulingfor consolidation process gives an opportunity for hosts toexperience high load and potentially reach the thresholdtemperature which is useful in evaluating our predictionmodels effectively. Therefore, The objective of our problemis defined as follows:

minimize T peak =T∑

t=0

m∑j=1

n∑i=1

δtjiTti

subject to u(hi) ≤ Umax,

T ti< Tred,m∑j=0

VMji(Rcpu, Rmem) ≤ hi(Rcpu, Rmem),

δtji = 0, 1,n∑

i=1

δtji= 1

(5)The objective function in Equation 5 minimizes the peak

temperature of the hosts while scheduling VMs dynamicallyin all the time steps t = 0, ... T. Here, list of VMs thatare to be scheduled are represented with the index j wherej = 1, ... m, and list of candidate hosts as i, wherei = 1, ... n. The T t

i indicates temperature of host i attime t. The constraints ensure the potential thermal andCPU thresholds are not violated due to increased workloadallocation. They also assure the capacity constraints, i.e,a host is considered as suitable only if enough resourcesare available for VM (Rcpu, Rmem). Here, δtji is a binarywith the value 1 if the VMj is allocated to hosti at timeinterval t, otherwise, 0. The summation of δtji is equal to 1,indicating that VMj is allocated to at most 1 host at timet. The objective function in Equation 5 is executed at eachscheduling interval to decide the target host for the VMsto be migrated. Finding an optimal solution for the aboveequation is an NP-hard problem and it is infeasible for on-line dynamic scheduling. Accordingly, to achieve the statedobjective and provide a near optimal approximate solutionwithin a reasonable amount of time, we propose a simpleThermal-Aware heuristic Scheduling (TAS) algorithm thatminimizes the peak temperature of data center hosts.

To dynamically consolidate the workloads (VMs) basedon current usage level, our proposed greedy heuristicscheduling algorithm 1 is executed for every schedulinginterval. The input to the algorithm is a list of VMs that areneeded to schedule, these are identified based on overloadand underload condition. To identify overloaded hosts, weuse CPU (Umax) and temperature threshold (Tred) together.In addition, if all the VMs from a host can be migrated tocurrent active hosts, the host is considered as underloaded

host. The VMs that are to be migrated from overloaded hostsare selected based on their minimum migration time, whichis the ratio between their memory usage and availablebandwidth [25]. The output is scheduling maps representingtarget hosts for those VMs. For each VM to be migrated(line 2), it tries to allocate a new target host from the activelist. In this process, it initializes necessary objects (lines 3-5)and the prediction model is invoked to predict the accuratetemperature of a host (line 7). The VM is allocated to a hostthat has the lowest temperature among active hosts (lines8-11). This ensures the reduction of peak temperature in thedata center and also avoids potential hotspots resulting inlower cooling cost. Moreover, this algorithm also assuresthe constraints listed in Equation 5 are met (line 10), sothat added workload will not create a potential hotspot byviolating threshold temperature (Tred). In addition, resourcerequirements of VM ( VM(Rx)) are satisfied, and the CPUutilization threshold is within the limit (Umax). If no suitablehost is found in the process, a new idle or inactive host isallocated (line 16) from the available resource pool.

Algorithm 1 Thermal Aware Dynamic Scheduling to Mini-mize Peak TemperatureInput: VMList- List of VMs to be scheduledOutput: Scheduling Maps

1: for t ← 0 to T do2: for all vm in VMList do3: allocatedHost← ∅4: hostList← Get list of active hosts5: minTemperature← maxV alue6: for all host in hostList do7: Ti ← Predict temperature by invoking prediction

model8: if (Ti < minTemperature) then9: minTemperature← Ti

10: if (Ti < Tred and u(hi) ≤ Umax and vm(Rx) <host(Rx)) then

11: allocatedHost← host12: end if13: end if14: end for15: if allocatedHost == ∅ then16: allocatedHost ← Get a new host from inactive

hosts list17: end if18: end for19: end for

The algorithm 1 has a worst case complexity of O(V N),which is a polynomial time complexity. Here, | V | is thenumber of VMs to be migrated and | N | is a number ofhosts in a data center.

7 PERFORMANCE EVALUATION

In this section, we evaluate the performance of the proposedalgorithm coupled with our prediction model and compareand analyze the results with baseline algorithms.

Page 8: 1 Thermal Prediction for Efficient Energy Management of ... · data center operations. Presence of multi-tenant users and their heterogeneous workloads exhibit non-coherent behav-ior

8

7.1 Experimental SetupWe evaluated the proposed Thermal aware dynamicscheduling algorithm through CloudSim toolkit [28]. Weextended ClouSim to incorporate the thermal elements andimplement the algorithm 1. We used real-world Bitbrain’sdata set [29], which has traces of resource consumption met-rics of business-critical workload hosted on Bitbrains infras-tructure. This data includes logs of over 1000 VMs workloadhosting on two types of machines. We have chosen this dataset as it represents real-world cloud Infrastructure usagepatterns and also metrics in this data set are similar to thefeatures we have collected in our data set (Table 1), thisis useful to construct precise input vectors for predictionmodels.

The total experiment period is set to 24 hours and thescheduling interval to 10 minutes, which is similar to theinterval of collected data. Note that, in the algorithm, pre-diction models are invoked in many places. The predictionis required to identify the host with the lowest temperature,to determine a host overloaded condition, and also to ensurethermal constraints by predicting their future time steptemperature.

To depict the experiments to a real-world setting, wemodel hosts configurations similar to the hosts in our datacenter, i.e, DELL C6320 machines. This machine has an IntelXeon E5-2600 processor with 64 cores dual CPU and 512 GBprimary memory. The VMs are configured based on VMsthat are offered by our research cloud 5, we choose 4 types ofgeneral flavor type. The number of hosts in the data centerconfiguration is 75, similar to the number of hosts in ourprivate cloud collected data, and number of VMs are setto 750, which is the maximum number possible on thesehosts based on their maximum resource requirements. Theworkload is generated to these VMs according to Bitbrainsdataset for respective VMs.

The CPU threshold (Umax) is set to 0.9. According tothe American Society of Heating, Refrigerating and Air-Conditioning Engineers (ASHRAE) [4] guidelines, the safeoperable temperature threshold for data center hosts is in-between 95 to 105 °C. This threshold is a combined value ofCPU temperature and inlet temperature together. Accord-ingly we set temperature threshold (Tred) to 105 °C.

The new target machines for VMs to be scheduled isfound based on algorithm 1. This requires to predict thetemperature of hosts in the data center. If the hosti temper-ature is predicted (Ti) at the beginning of timestep t+1 thenthe input to prediction model is a single vector consisting ofa set of features (CPU , Pc, fs1 − fs4, NCPU , NCPUx, R,Rx, NRx, NTx, Nvm) representing its resource and usagemetrics along with the power and fan speed measurements.The resource usage metrics are easily gathered from hostsutilization level based on its currently hosted VMs workloadlevel. To estimate the power Pi, we use SPECpower bench-mark [30], which provides accurate power consumption (inwatts) for our modeled host (DELL C6320) based on CPUutilization. We estimate fan speeds from simple regressionusing remaining features to simplify the problem.

We export the trained models as serialized python ob-jects and expose them to our scheduling algorithm by host-

5. https://tinyurl.com/y5zxe8em

0 200 400 600 800 1000 1200 1400Time (in mins)

60

65

70

75

80

85

90

95

Aver

age

host

tem

pera

ture

(C)

TASRRGRANITE

Fig. 5: Average temperature in each scheduling interval(total experiment time of 24 hours, with scheduling intervalof 10 minute)

ing on HTTP Flask application 6. The CloudSim schedulingentities invoke the prediction model through REST APIsby passing feature vector and host ID as input, the HTTPapplication returns predicted temperature for the associatedhost.

7.2 Analysis of ResultsWe compare the results with two baseline algorithms asbelow.

• Round Robin (RR) - This algorithm tries to distributethe workload equally among all hosts by placingVMs on hosts in a circular fashion. The similarconstraints are applied as in algorithm 1. We showthat the notion of equal distribution of workloadsfails to minimize the peak temperature and thermalvariations in a data center.

• GRANITE- This is a thermal-aware VM schedulingalgorithm proposed in [31] that minimizes comput-ing and cooling energy holistically. We choose thisparticular algorithm, because, similar to us, it alsoaddresses the thermal-aware dynamic VM schedul-ing problem.

We use our predcition models to estiamte the temrepa-ture in both RR and GRANITE algorithms. For GRANITE,the required parameters are set similar to thier algorithm in[31] including overload and underload detection methods.The comparison of the average temperature from all hosts ineach scheduling interval by all the three algorithms is shownin Figure 5. Our Thermal-Aware Scheduling (TAS) has thelowest average temperature compared to RR and GRANITE.The RR algorithms’ equal workload distribution policy re-sults in less variation in average temperature. However, thiswill not help to reduce the peak temperature in data centerirrespective of its intuitive equal distribution behavior as itdoesn’t consider thermal behavior of individual hosts andits decision are completely thermal agnostic. The GRANITE

6. http://flask.pocoo.org

Page 9: 1 Thermal Prediction for Efficient Energy Management of ... · data center operations. Presence of multi-tenant users and their heterogeneous workloads exhibit non-coherent behav-ior

9

60 70 80 90 100Temperature ( C)

0.00

0.02

0.04

0.06

0.08

0.10De

nsity

(a) TAS

60 70 80 90 100Temperature ( C)

0.00

0.01

0.02

0.03

0.04

0.05

0.06

Dens

ity

(b) RR

60 70 80 90 100Temperature ( C)

0.00

0.01

0.02

0.03

0.04

0.05

0.06

Dens

ity

(c) GRANITE

Fig. 6: Temperature distribution analysis due to scheduling (aggregated from all hosts in experimented period)

60 70 80 90 100Temperature ( C)

0.0

0.2

0.4

0.6

0.8

1.0

Prob

abilit

y De

nsity

TAS ( =73.65, = 6.82)RR ( =80.69, = 10.49)Granite ( =77.36, = 9.34)

Fig. 7: CDF between TAS and RR and GRANITE

policy has a high average temperature and large variationsbetween scheduling intervals due to its inherent dynamicthreshold policies. To further analyze the distribution oftemperature due to two scheduling approaches, we draw ahistogram with Kernel Density Estimation (KDE) by collect-ing temperature data from all the hosts in each schedulinginterval as shown in Figure 6. Most of the hosts in the datacenter operate around 70 to 80 °C in TAS (Figure 6a), wellbelow the threshold due to its expected peak temperatureminimizing objective. However, the RR approach results inmore thermal variations with sustained high temperature(Figure 6b). The GRANITE also has significant distributionsaround the peak temperature (6c). This temperature distri-bution is effectively summarized using the Cumulative Dis-tribution Function (CDF) between three approaches (Figure7). As we can see in Figure 7, TAS reaches the probabilitydensity value of 1 well below 100 °C, indicating most ofthe hosts operate in reduced temperature value. RR andGRANITE has a peak temperature of more than 100 °Cwith high cumulative probability. In addition, as depicted inFigure 7, the average and standard deviation of temperaturein TAS (µ = 75.65, σ = 6.82) is lesser compared to theother two approaches (µ = 80.69, σ = 10.49 for RR andµ = 77.36, σ = 9.34 for Granite ), this is also evidenced by

TABLE 4: Scheduling results compared with RR and GRAN-ITE algorithm

Algorithm Peak Temperature( °C)

Total Energy(kwh)

ActiveHosts

TAS 95 172.20 4RR 101.44 391.57 18

GRANITE 101.81 263.20 11

Figure 5.Further results of the experiments are depicted in Table

4. The total energy consumption by TAS, RR and GRANITEare 172.20 kWh, 391.57 kWh, and 263.20 kWh, respectively(the total energy is combination of cooling and computingenergy calculated as in [10]). Therefore, RR and GRANITEhave 56 % and 34.5 % more energy consumption than TAS,respectively. This is due to the fact that RR and GRANITEdistribute workload into more hosts resulting in a highnumber of active hosts. In this experimented period, RR andGRANITE had 18 and 11 average number of active hostswhile the TAS algorithm resulted in 4 active hosts. Fur-thermore, although RR distributes workload among manyhosts, its thermal agnostic nature had a peak temperatureof 101.44 °C, GRANITE had peak temperature of 101.80 °Cand TAS had attained a maximum of 95.5 °C during theexperimentation period which is 6.5 °C lesser than latterapproaches. This demonstrates that the accurate predictionof host temperature with effective scheduling strategy canreduce the peak temperature and also save a significantamount of energy in the data center.

7.3 Dealing with False PredictionsIn our scheduling experiments, we observed that a few ofthe temperature predictions have resulted in some largenumber which is beyond the boundaries of the expectedvalue. A further close study into such cases has revealedthat this happens with particularly three hosts which werealmost idle in the data collection period of 3 months havinga CPU load less than 1%, which means the models trainedfor these hosts have limited variations in their feature set.As the trained models did not have any instance close tothe instance of prediction, prediction results in an extremevariant value. Such a false prediction in runtime resultsin an incorrect scheduling decision that affects the normalbehavior of the system. In this regard, the scheduling pro-cess should consider such adverse edge cases. To tackle

Page 10: 1 Thermal Prediction for Efficient Energy Management of ... · data center operations. Presence of multi-tenant users and their heterogeneous workloads exhibit non-coherent behav-ior

10

0 20 40 60 80 100 120 140 160Importance (weight)

Ntx

Nrx

fs3

fs2

fs4

R

CPU

Ncpux

Rx

Nvm

fs1

PcFe

atur

es

5

8

27

35

40

47

50

60

81

87

107

153

(a) Feature importance (weight)- number of times aparticular feature occurs in the trees

2 4 6 8 10 12Number of features

3.0

3.5

4.0

4.5

5.0

5.5

RMSE

(b) Feature threshold analysis

Fig. 8: Feature analysis

this problem, we set minimum and maximum bound forexpected prediction value based on our observations in thedataset, and for any prediction beyond these boundaries, wepass the input vector to all remaining host’s model and takean average of predicted value as a final prediction value. Inthis way, we try to avoid the bias influenced by a particularhost and also get a reasonably good prediction result. In caseof a very large number of hosts, subsets of hosts can be usedfor this.

This also proves that, in order to effectively use theprediction models, training dataset should have a gooddistribution of values for all the hosts. Deploying suchmodels in a real-world data center requires proper datapreprocessing to handle such inconsistencies so that MLmodels perform better.

8 FEATURE SET ANALYSIS

We carried out a feature analysis to identify the importanceof each feature towards the model performance. This anal-ysis can also be used in the feature selection process toremove the redundant features, reduce the computationalcost, and increase the performance. Figure 8a shows theimportance of each feature in the constructed XGBoostmodel. Here, the weight metric associated with each featurecorresponds to its respective number of occurrence in theconstructed tree which indirectly notifies its importance.Based on the results, host power (Pc), fanspeed1 (fs1) andnumber of VMs (Nvm) are the most important featurestowards accurate prediction. It is important to note that,though we have 4 fan speeds, the model intuitively selectsone fan speed with more weight, this is due to the fact thatall four fans operate almost at same rpm, which is observedin our data set. The least important feature is networkmetrics (Nrx, Ntx) along with remaining three fan speedreadings. The crucial observation is that the model giveshigh importance to power instead of CPU load, indicating,the high correlation between temperature and power. Thenumber of cores (NC) is not included in the tree as it has

constant value across hosts introducing no variation in thedata.

The performance of temperature prediction with dif-ferent thresholds can be observed in Figure 8b. We startwith the most important feature and recursively add morefeatures according to their importance to the model. The yaxis indicates RMSE value and x axis shows a number offeatures. The first three features (Pc,fs1,Nvm) significantlycontribute to prediction accuracy and the accuracy gain islittle as we add more features to the model. Therefore, basedon required accuracy or RMSE value, we can select top nfeatures to effectively train the model with less complexity.

9 RELATED WORK

Thermal management using theoretical analytical modelshas been studied by many researchers in the recent past [22],[32], [8], [7]. These models based on mathematical relation-ships to estimate the temperature are not accurate enoughwhen compared to the actual values. Moreover, [32], [7] usesanalytical models and targets HPC systems where jobs havespecific completion time, while our work target the vir-tualised cloud datacenters with long running applicationsthat needs dynamic scheduling and migration in realtime.Furthermore, some of the studies have also explored usingCFD models [11]. Computational Fluid Dynamics (CFD)models provide an accurate thermal measurement, however,their massive computational demand hinders their adoptionin real-time online tasks such as scheduling. Researchersare audaciously exploring data-driven ML algorithms tooptimize the computing system efficiency [19], [15]. Withthe help of ML techniques, Google data centers are able toreduce up to 40 % of their cooling cost [17].

Thermal and energy management inside the data centerusing machine learning techniques are studied by manyresearchers in recent years. The vast applications have beenused for finding an optimal setting or configurations ofsystems to achieve the energy efficiency [33]. However,ML techniques specific to temperature prediction is stud-ied by Zhang et al. [34] where they proposed Gaussian

Page 11: 1 Thermal Prediction for Efficient Energy Management of ... · data center operations. Presence of multi-tenant users and their heterogeneous workloads exhibit non-coherent behav-ior

11

process based host temperature prediction model in HPCdata centers. They used two node Intel Xeon Phi clusterto run the HPC test applications and collect the trainingdata. In addition, they also proposed a greedy algorithm forapplication placement to minimize the thermal variationsacross the system. In an extended work [6], they enhancedtheir solution to include more efficient models such as lassolinear and Multilayer Perceptron (MLP). The results haveshown that predictive models are accurate and performwell in data center resource management aspects. Imes etal. [33] explored different ML classifiers to configure thedifferent hardware counters to achieve energy efficiencyfor a given application. They tested 15 different classifiersincluding Support Vector Machine (SVM), K-Nearest Neigh-bours (KNN), and Random Forest(RF), etc. This work onlyconsiders energy as an optimization metric ignoring thethermal aspect. Moreover, these works are specific to HPCdata centers where temperature estimation is done for appli-cation specific which requires access to application counters.Nevertheless, our proposed solution is for Infrastructureclouds, where such an approach is not feasible due to lim-ited access to application counters enforced by the isolatedvirtualized environment. Thus, we rely on features thatcompletely surpass application counters and only considerhost level resource usage and hardware counters and yetachieve a high prediction accuracy.

Furthermore, Ignacio et al. [35] showed the thermalanomaly detection technique using Artificial Neural Net-works (ANNs). They specifically use Self Organising Maps(SOM) to detect abnormal behavior in the data center froma previously trained reliable performance. They evaluatedtheir solution using traces of anomalies from a real datacenter. Moore et al. [13] proposed Weatherman, a predictivethermal mapping framework for data centers. They studiedthe effect of workload distribution on cooling settings andtemperature in the data center. These models are designedto find the thermal anomalies and manage the workload ata data center level without giving any attention to accuratetemperature prediction.

In addition to thermal management, many others ap-plied ML techniques for scheduling in distributed systemsto optimize the parameters such as energy, performance,and cost. Among many existing ML approaches, Rein-forcement Learning (RL) is widely used for this purpose[36], [18], [37]. Orheab et al. [36] studied the RL approachfor scheduling in distributed systems. They used the Q-learning algorithm to train the model that learns optimalscheduling configurations. In addition, they proposed aplatform which provides scheduling as a service for betterexecution time and efficiency. Cheng et al. proposed DRLcloud, which provides RL framework for provisioning andtask scheduling in the cloud to increase the energy efficiencyand reduce the tasks execution time. Similarly, [37] et al.studied deep RL based resource management in distributedsystems. Learning to schedule is prominent with RL basedmethods due to the fact that RL models keep improving inruntime [38] which is convenient for scheduling. However,our work is different from these works in a way that, theprimary objective of our problem is to estimate the datacenter host temperature accurately to facilitate the resourcemanagement system tasks. In this regard, our work acts as

complementary to these solutions where such thermal pre-diction models can be adopted by these ML-based schedul-ing frameworks to further enhance their efficiency.

10 CONCLUSIONS AND FUTURE WORK

Estimating the temperature in the data center is a complexand non-trivial problem. Existing approaches for temper-ature prediction are inaccurate and computationally ex-pensive. Optimal thermal management with accurate tem-perature prediction can reduce the operational cost of adata center and increase reliability. Data-driven temperatureestimation of hosts in a data center can give us moreaccurate prediction than simple mathematical models as wewere able to take into consideration CPU and inlet airflowtemperature variations through measurements. Our studywhich is based on physical host level data collected fromour University’s private cloud has shown a large thermalvariation present between hosts including CPU and inlettemperature. To accurately predict the host temperature,we explored several machine learning algorithms. Based onthe results, we found a gradient boosting based XGBoostmodel for temperature prediction is the best. Our extensiveempirical evaluation has achieved high prediction accuracywith the average RMSE value of 0.05. In other words, ourprediction model has an average error of 2.38 °C. Comparedto an existing theoretical model, it reduces the predictionerror of 7 °C.

Guided by these prediction models, we proposed a dy-namic scheduling algorithm for cloud workloads to mini-mize the peak temperature. The proposed algorithm is ableto save up to 34.5 % more of energy and reduce up to 6.5 °Cof average peak temperature compared to the best baselinealgorithm. It is important to note that, though the modelsbuilt for one data center are optimized for its own(as eachdata center’s physical environment and parameters vastlychange), the methodology presented in this work is genericand can be applied to any cloud data center given thesufficient amount of data collected from the respective datacenters.

In the future, we plan to explore more sophisticatedmodels to achieve better accuracy and performance. Wealso intend to extend the work for heterogeneous nodes likeGPUs or FPGAs. Another interesting direction is to considerparameters related to weather and predictions and theireffect on cooling and scheduling long jobs.

ACKNOWLEDGMENT

We thank Bernard Meade and Justin Mammarella at Re-search Platform Services, The University of Melbourne fortheir support and providing access to the infrastructurecloud and data.

REFERENCES

[1] R. Buyya, C. S. Yeo, S. Venugopal, J. Broberg, and I. Brandic,“Cloud computing and emerging it platforms: Vision, hype, andreality for delivering computing as the 5th utility,” Future Genera-tion computer systems, vol. 25, no. 6, pp. 599–616, 2009.

[2] A. Shehabi, S. Smith, D. Sartor, R. Brown, M. Herrlin, J. Koomey,E. Masanet, N. Horner, I. Azevedo, and W. Lintner, “United statesdata center energy usage report,” 2016.

Page 12: 1 Thermal Prediction for Efficient Energy Management of ... · data center operations. Presence of multi-tenant users and their heterogeneous workloads exhibit non-coherent behav-ior

12

[3] C. D. Patel, C. E. Bash, and A. H. Beitelmal, “Smart cooling of datacenters,” Jun. 3 2003, uS Patent 6,574,104.

[4] ASHRAE, “American society of heating, refrigeratingand air-conditioning engineers,” 2018. [Online]. Available:http://tc0909.ashraetcs.org/

[5] U. plc, “A guide to ensuring your ups batteriesdo not fail from ups systems,” 2018. [Online].Available: http://www.upssystems.co.uk/knowledge-base/the-it-professionals-guide-to-standby-power/ part-8-how-to-ensure-your-batteries-dont-fail/

[6] K. Zhang, A. Guliani, S. Ogrenci-Memik, G. Memik, K. Yoshii,R. Sankaran, and P. Beckman, “Machine learning-based temper-ature prediction for runtime thermal management across systemcomponents,” IEEE Transactions on Parallel and Distributed Systems,vol. 29, no. 2, pp. 405–419, Feb 2018.

[7] Q. Tang, S. K. S. Gupta, and G. Varsamopoulos, “Energy-efficient thermal-aware task scheduling for homogeneous high-performance computing data centers: A cyber-physical approach,”IEEE Transactions on Parallel and Distributed Systems, vol. 19, no. 11,pp. 1458–1472, 2008.

[8] H. Sun, P. Stolf, and J.-M. Pierson, “Spatio-temporal thermal-aware scheduling for homogeneous high-performance computingdatacenters,” Future Generation Computer Systems, vol. 71, pp. 157–170, 2017.

[9] S. Zhang and K. S. Chatha, “Approximation Algorithm for theTemperature Aware Scheduling Problem,” in Proceedings of theInternational Conference on Computer-Aided Design, pp. 281–288,2007.

[10] S. Ilager, K. Ramamohanarao, and R. Buyya, “Etas: Energy andthermal-aware dynamic virtual machine consolidation in clouddata center with proactive hotspot mitigation,” Concurrency andComputation: Practice and Experience, vol. 0, no. 0, p. e5221, 2019.

[11] J. Choi, Y. Kim, A. Sivasubramaniam, J. Srebric, Q. Wang, andJ. Lee, “A cfd-based tool for studying temperature in rack-mounted servers,” IEEE Transaction on Computers, vol. 57, no. 8,pp. 1129–1142, 2008.

[12] A. Almoli, A. Thompson, N. Kapur, J. Summers, H. Thompson,and G. Hannah, “Computational fluid dynamic investigation ofliquid rack cooling in data centres,” Applied energy, vol. 89, no. 1,pp. 150–155, 2012.

[13] J. Moore, J. S. Chase, and P. Ranganathan, “Weatherman: Auto-mated, online and predictive thermal mapping and managementfor data centers,” in Proceedings of the IEEE International Conferenceon Autonomic Computing, June 2006, pp. 155–164.

[14] M. Zapater, J. L. Risco-Martın, P. Arroba, J. L. Ayala, J. M. Moya,and R. Hermida, “Runtime data center temperature predictionusing grammatical evolution techniques,” Applied Soft Computing,vol. 49, pp. 94–107, 2016.

[15] G. C. Fox, J. A. Glazier, J. C. S. Kadupitiya, V. Jadhao, M. Kim,J. Qiu, J. P. Sluka, E. T. Somogyi, M. Marathe, A. Adiga, J. Chen,O. Beckstein, and S. Jha, “Learning everywhere: Pervasivemachine learning for effective high-performance computation,”CoRR, 2019. [Online]. Available: http://arxiv.org/abs/1902.10810

[16] H. Mao, M. Schwarzkopf, S. B. Venkatakrishnan, Z. Meng, andM. Alizadeh, “Learning scheduling algorithms for data processingclusters,” CoRR, vol. abs/1810.01963, 2018. [Online]. Available:http://arxiv.org/abs/1810.01963

[17] J. Gao, “Machine learning applications for data center optimiza-tion,” 2014.

[18] M. Cheng, J. Li, and S. Nazarian, “DRL-cloud: Deep reinforce-ment learning-based resource provisioning and task schedulingfor cloud service providers,” in Proceedings of the Asia and SouthPacific Design Automation Conference, ASP-DAC, vol. 2018-Janua,pp. 129–134, 2018.

[19] D. Jeff, “ML for system, system for ML, keynote talk inWorkshop on ML for Systems, NIPS,” 2018. [Online]. Available:http://mlforsystems.org/

[20] Y. Luo, X. Wang, S. Ogrenci-Memik, G. Memik, K. Yoshii, andP. Beckman, “Minimizing thermal variation in heterogeneous hpcsystems with fpga nodes,” in Proceedings of the 2018 IEEE 36thInternational Conference on Computer Design (ICCD), Oct 2018, pp.537–544.

[21] T. Chen and C. Guestrin, “Xgboost: A scalable tree boostingsystem,” in Proceedings of the 22nd acm sigkdd international conferenceon knowledge discovery and data mining. ACM, 2016, pp. 785–794.

[22] J. D. Moore, J. S. Chase, P. Ranganathan, and R. K. Sharma, “Mak-ing scheduling ”cool”: Temperature-aware workload placement

in data centers.” in Proceedings of the USENIX Annual TechnicalConference. USENIX Association, 2005, pp. 61–75.

[23] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg et al.,“Scikit-learn: Machine learning in python,” Journal of machinelearning research, vol. 12, no. Oct, pp. 2825–2830, 2011.

[24] R. Caruana and A. Niculescu-Mizil, “An empirical comparisonof supervised learning algorithms,” in Proceedings of the 23rdInternational Conference on Machine Learning, ser. ICML ’06. NewYork, NY, USA: ACM, 2006, pp. 161–168.

[25] A. Beloglazov, J. Abawajy, and R. Buyya, “Energy-aware resourceallocation heuristics for efficient management of data centers forcloud computing,” Future Generation Computer Systems, vol. 28,no. 5, pp. 755–768, 2012.

[26] A. Verma, G. Dasgupta, T. K. Nayak, P. De, and R. Kothari,“Server workload analysis for power minimization using consoli-dation,” in in Proceedings of the USENIX Annual technical conference.USENIX Association, 2009, pp. 28–28.

[27] M. Xu, A. V. Dastjerdi, and R. Buyya, “Energy efficient schedulingof cloud application components with brownout,” IEEE Transac-tions on Sustainable Computing, vol. 1, no. 2, pp. 40–53, July 2016.

[28] R. N. Calheiros, R. Ranjan, A. Beloglazov, C. A. De Rose, andR. Buyya, “Cloudsim: a toolkit for modeling and simulation ofcloud computing environments and evaluation of resource provi-sioning algorithms,” Software: Practice and experience, vol. 41, no. 1,pp. 23–50, 2011.

[29] S. Shen, V. van Beek, and A. Iosup, “Statistical characterization ofbusiness-critical workloads hosted in cloud datacenters,” in Pro-ceedings of the 15th IEEE/ACM International Symposium on Cluster,Cloud and Grid Computing (CCGrid). IEEE, 2015, pp. 465–474.

[30] SPEC, “Standard performance evaluation corporation,” 2018.[Online]. Available: https://www.spec.org/benchmarks.html

[31] X. Li, P. Garraghan, X. JIANG, Z. Wu, and J. Xu, “Holistic VirtualMachine Scheduling in Cloud Datacenters towards MinimizingTotal Energy,” IEEE Transactions on Parallel and Distributed Systems,vol. 29, no. 6, pp. 1–1, 2017.

[32] T. Cao, W. Huang, Y. He, and M. Kondo, “Cooling-aware jobscheduling and node allocation for overprovisioned hpc systems,”in Proceedings of the International Parallel and Distributed ProcessingSymposium (IPDPS). IEEE, 2017, pp. 728–737.

[33] C. Imes, S. Hofmeyr, and H. Hoffmann, “Energy-efficient appli-cation resource scheduling using machine learning classifiers,” inProceedings of the 47th International Conference on Parallel Processing,ser. ICPP. New York, NY, USA: ACM, 2018, pp. 45:1–45:11.

[34] K. Zhang, S. O. Memik, G. Memik, K. Yoshii, R. Sankaran, andP. H. Beckman, “Minimizing thermal variation across systemcomponents.” in Proceedings of International Parallel and DistributedProcessing Symposium. IEEE Computer Society, 2015, pp. 1139–1148.

[35] I. Aransay, M. Z. Sancho, P. A. Garcıa, and J. M. M. Fernandez,“Self-Organizing maps for detecting abnormal thermal behaviorin data centers,” in Processdings of the 8th IEEE InternationalConference on Cloud Computing (CLOUD), pp. 138–145, 2015.[Online]. Available: http://oa.upm.es/42751/

[36] J. P. D. Comput, A. Iulian, F. Pop, and I. Raicu, “New schedulingapproach using reinforcement learning for heterogeneous dis-tributed systems,” J. Parallel Distrib. Comput., vol. 117, pp. 292–302,2018.

[37] H. Mao, M. Alizadeh, I. Menache, and S. Kandula, “Resourcemanagement with deep reinforcement learning,” in Proceedings ofthe 15th ACM Workshop on Hot Topics in Networks, ser. HotNets ’16.New York, NY, USA: ACM, 2016, pp. 50–56.

[38] R. S. Sutton, A. G. Barto et al., Introduction to reinforcement learning.MIT press Cambridge, 1998, vol. 135.

Page 13: 1 Thermal Prediction for Efficient Energy Management of ... · data center operations. Presence of multi-tenant users and their heterogeneous workloads exhibit non-coherent behav-ior

13

Shashikant Ilager is a PhD candidate withthe Cloud Computing and Distributed Systems(CLOUDS) Laboratory at the University of Mel-bourne, Australia. He received his Bachelor’sand Master’s degree in Computer Science fromthe VTU and University of Hyderabad in 2013and 2016, respectively. His research interests in-clude distributed systems and cloud computing.He is currently working on resource manage-ment through data-driven predictive optimizationtechniques focusing on power and thermal as-

pects of large scale data centers.

Kotagiri Ramamohanarao received the PhDdegree from Monash University. He is currentlya professor of computer science with the Uni-versity of Melbourne. He served on the editorialboards of the Computer Journal. At present, heis on the editorial boards of Universal ComputerScience, Data Mining, and the International VeryLarge Data Bases Journal. He was the programco-chair for VLDB, PAKDD, DASFAA, and DOODconferences.

Rajkumar Buyya is a Redmond Barry Distin-guished Professor and Director of the CloudComputing and Distributed Systems (CLOUDS)Laboratory at the University of Melbourne, Aus-tralia. He is also serving as the founding CEOof Manjrasoft, a spin-off company of the Uni-versity, commercializing its innovations in cloudcomputing. He has authored over 625 publica-tions and seven textbooks including ”MasteringCloud Computing” published by McGraw Hill,China Machine Press, and Morgan Kaufmann

for Indian, Chinese and international markets respectively. He is one ofthe highly cited authors in computer science and software engineeringworldwide (h-index=134, g-index=298, 95,800+ citations). He is a fellowof the IEEE.


Recommended