+ All Categories
Home > Documents > Rack aware scheduling in HPC data centers: an energy conservation strategy

Rack aware scheduling in HPC data centers: an energy conservation strategy

Date post: 23-Dec-2016
Category:
Upload: vipin
View: 213 times
Download: 0 times
Share this document with a friend
15
Cluster Comput (2013) 16:559–573 DOI 10.1007/s10586-012-0224-9 Rack aware scheduling in HPC data centers: an energy conservation strategy Vikas Ashok Patil · Vipin Chaudhary Received: 5 January 2012 / Accepted: 27 June 2012 / Published online: 22 August 2012 © Springer Science+Business Media, LLC 2012 Abstract Energy consumption in high performance com- puting data centers has become a long standing issue. With rising costs of operating the data center, various techniques need to be employed to reduce the overall energy consump- tion. Currently, among others there are techniques that guar- antee reduced energy consumption by powering on/off the idle nodes. However, most of them do not consider the en- ergy consumed by other components in a rack. Our study addresses this aspect of the data center. We show that we can gain considerable energy savings by reducing the en- ergy consumed by these rack components. In this regard, we propose a scheduling technique that will help schedule jobs with the above mentioned goal. We claim that by our scheduling technique we can reduce the energy consump- tion considerably without affecting other performance met- rics of a job. We implement this technique as an enhance- ment to the well-known Maui scheduler and present our re- sults. We propose three different algorithms as part of this technique. The algorithms evaluate the various trade-offs that could be possibly made with respect to overall cluster performance. We compare our technique with various cur- rently available Maui scheduler configurations. We simulate a wide variety of workloads from real cluster deployments using the simulation mode of Maui. Our results consistently show about 7 to 14 % savings over the currently available Maui scheduler configurations. We shall also see that our technique can be applied in tandem with most of the exist- ing energy aware scheduling techniques to achieve enhanced energy savings. We also consider the side effects of power losses due to the network switches as a result of deploying our technique. V.A. Patil ( ) · V. Chaudhary State University of New York, Buffalo, USA e-mail: [email protected] We compare our technique with the existing techniques in terms of the power losses due to these switches based on the results in Sharma and Ranganathan, Lecture Notes in Com- puter Science, vol. 5550, 2009 and account for the power losses. We there on provide a best fit scheme with the rack considerations. We then propose an enhanced technique that merges the two extremes of node allocation based on rack information. We see that we can provide a way to configure the sched- uler based on the kind of workload that it schedules and reduce the effect of job splitting across multiple racks. We further discuss how the enhancement can be utilized to build a learning model which can be used to adaptively adjust the scheduling parameters based on the workload experi- enced. Keywords Rack aware scheduling · Power conservation · Scheduling 1 Introduction The energy used by the current data centers is significant. The EPA report to the US Congress on “Server and Data Center Efficiency” [1] estimated a usage of 61 billion kilo- watt hours (kWh) in 2006. The same report predicted its rise to 100 billion kWh in 2011. The carbon emission equiva- lent for this amount of energy consumption was about 846 million metric tons in 2006. The report highlights the mag- nitude of the energy use by the current data centers and the need for aggressive energy conservation strategies to be adopted by the operators of the data centers. Due to the ever increasing business and scientific computing needs, this problem is exacerbated and has resulted in a significant rise in operating costs.
Transcript
Page 1: Rack aware scheduling in HPC data centers: an energy conservation strategy

Cluster Comput (2013) 16:559–573DOI 10.1007/s10586-012-0224-9

Rack aware scheduling in HPC data centers: an energyconservation strategy

Vikas Ashok Patil · Vipin Chaudhary

Received: 5 January 2012 / Accepted: 27 June 2012 / Published online: 22 August 2012© Springer Science+Business Media, LLC 2012

Abstract Energy consumption in high performance com-puting data centers has become a long standing issue. Withrising costs of operating the data center, various techniquesneed to be employed to reduce the overall energy consump-tion. Currently, among others there are techniques that guar-antee reduced energy consumption by powering on/off theidle nodes. However, most of them do not consider the en-ergy consumed by other components in a rack. Our studyaddresses this aspect of the data center. We show that wecan gain considerable energy savings by reducing the en-ergy consumed by these rack components. In this regard,we propose a scheduling technique that will help schedulejobs with the above mentioned goal. We claim that by ourscheduling technique we can reduce the energy consump-tion considerably without affecting other performance met-rics of a job. We implement this technique as an enhance-ment to the well-known Maui scheduler and present our re-sults. We propose three different algorithms as part of thistechnique. The algorithms evaluate the various trade-offsthat could be possibly made with respect to overall clusterperformance. We compare our technique with various cur-rently available Maui scheduler configurations. We simulatea wide variety of workloads from real cluster deploymentsusing the simulation mode of Maui. Our results consistentlyshow about 7 to 14 % savings over the currently availableMaui scheduler configurations. We shall also see that ourtechnique can be applied in tandem with most of the exist-ing energy aware scheduling techniques to achieve enhancedenergy savings.

We also consider the side effects of power losses due tothe network switches as a result of deploying our technique.

V.A. Patil (�) · V. ChaudharyState University of New York, Buffalo, USAe-mail: [email protected]

We compare our technique with the existing techniques interms of the power losses due to these switches based on theresults in Sharma and Ranganathan, Lecture Notes in Com-puter Science, vol. 5550, 2009 and account for the powerlosses. We there on provide a best fit scheme with the rackconsiderations.

We then propose an enhanced technique that merges thetwo extremes of node allocation based on rack information.We see that we can provide a way to configure the sched-uler based on the kind of workload that it schedules andreduce the effect of job splitting across multiple racks. Wefurther discuss how the enhancement can be utilized to builda learning model which can be used to adaptively adjustthe scheduling parameters based on the workload experi-enced.

Keywords Rack aware scheduling · Power conservation ·Scheduling

1 Introduction

The energy used by the current data centers is significant.The EPA report to the US Congress on “Server and DataCenter Efficiency” [1] estimated a usage of 61 billion kilo-watt hours (kWh) in 2006. The same report predicted its riseto 100 billion kWh in 2011. The carbon emission equiva-lent for this amount of energy consumption was about 846million metric tons in 2006. The report highlights the mag-nitude of the energy use by the current data centers andthe need for aggressive energy conservation strategies tobe adopted by the operators of the data centers. Due to theever increasing business and scientific computing needs, thisproblem is exacerbated and has resulted in a significant risein operating costs.

Page 2: Rack aware scheduling in HPC data centers: an energy conservation strategy

560 Cluster Comput (2013) 16:559–573

Electricity cost is one of the major operation costs of adata center [2]. Reducing the energy consumption in a datacenter can significantly reduce the operating costs. This hassparked serious research interests in both the academic andcommercial research groups. As a result, there have beenconsiderable improvements at the software, hardware andinfrastructure levels of the data center ever since.

This view of reducing energy consumption to cut downthe operating costs in a data center was deemed contradic-tory in High Performance Computing (HPC) data centers,as achieving improvements in performance has always beenthe key focus. However, even HPC data-centers are plaguedby the increasing costs due to energy consumption. The mostpower consuming super computer runs at 6.95 megawatts(MW) and IBM’s roadrunner, the No 2 in the TOP 500 listconsumes 2.48 MW [3]. This has sparked significant re-search interest to reduce the energy consumption of thesespecialized data centers.

Energy conservation by improvements in job schedulingtechniques has been one such area of active research. Jobscheduling is an important aspect of any HPC data center.The function of the job scheduler is to allocate the data cen-ter resources such as CPU’s, storage and network, to theincoming jobs and increase the overall cluster utilization.From an energy conservation point of view it has to do muchmore than that. It needs to allocate the resources that willreduce the overall energy consumed by the cluster with-out deviating much from the job turnaround times. A fewcurrent day job schedulers take this issue of power conser-vation into consideration and are being widely used acrossacademic and commercial HPC installations [3]. These HPCinstallations use this feature of job schedulers to turn on/offthe idle nodes or even perform dynamic scaling of powersupply to the cluster components, thereby conserving con-siderable energy. Most of these job schedulers are based atnode level granularity, which means the scheduler views thedata center as a set of nodes (and sometimes also as thenode’s sub-components such as number of cores, memorysize, etc.). Though scheduling at the node level is consistentwith the resource demands for jobs, we consider it unsatis-factory from an energy conservation perspective.

In this work we show that we can achieve additional sav-ings in power by considering the job scheduling at the racklevel granularity. A rack is an enclosure of nodes and oftenhas certain additional components associated with it. Thesecomponents are Interconnect bays, Fans, Power distributionunit (PDU), Blowers, etc. We propose a set of enhancementsto the scheduling algorithm for performing resource alloca-tions at the rack level granularity. We also implement it asan enhancement to the existing Maui scheduler and test ourproposal on real datacenter workloads.

The Maui scheduler is one of the widely used sched-ulers in HPC data centers [4]. There are multiple reasons for

choosing the Maui scheduler over any other existing clus-ter scheduler. One of the reasons is that Maui is an opensource scheduler which would allow modifying the sched-uler source and test the rack awareness concept. The other isthat Maui is a highly configurable scheduler. We can writeplugins that can modify the default behavior of the sched-uler. We shall talk about the specific feature that we modifyand how Maui seamlessly allows for such a modification.There are a few other reasons too that we discuss in the sec-tion describing the Maui scheduler.

We further enhance our initial approach to study the ef-fects of the power losses due to the job splitting for paralleljobs resulting from the deployment of our scheduling tech-nique. We call this the best-rack-fit approach. We then com-pare the power savings with the first approach. We furthermerge the two approaches and propose a more elegant solu-tion encompassing the first two approaches. This merged ap-proach consists of a set of parameters which can be tweakedbased on the nature of the workload at hand. A machinelearning approach can be eventually applied which will al-low one to automatically tweak the parameters.

The scheduling approaches explored are mainly targetedtowards low utilization clusters. After gathering workloadlogs from the workload archives, we have come to a conclu-sion that there exist a good number of clusters that are lowutilization clusters in the real world. Section 5 shows a graphof the average utilizations of some of the well-known clus-ters. By “low utilization” we mean clusters with an averageutilization of about 40 to 50 %. Our technique is very wellsuited for such low utilization clusters and it gives maximumbenefit for such clusters. The savings diminish as the clusterutilization increases. For clusters with average utilization ofabout 90 % we observe energy savings of about 10 % with-out impacting the turnaround time substantially. However, asa considerable number of clusters fall in this category (lowutilization) we would find wide use of our technique. Over-all the study adds a new perspective to the existing schedul-ing strategies and provides an opportunity for existing poweraware scheduling strategies to incorporate our technique toachieve enhanced savings.

The work also highlights certain statistics of the energycosts associated at the rack level. This data is gathered froman operational commercial HPC data center. The data is uti-lized to perform simulations using the simulation mode ofthe Maui scheduler. The statistics gathered from these sim-ulations show considerable energy conservations.

Essentially a typical rack consists of 2 to 3 enclosures.Each enclosure in turn consists of several nodes which aretypically blades. We use the term rack and enclosure invari-ably in our discussion throughout as there is not much of adifference in the application of our technique. We may alsoencounter the term frame, which essentially means rack.This term is found in the Maui scheduler configuration guide

Page 3: Rack aware scheduling in HPC data centers: an energy conservation strategy

Cluster Comput (2013) 16:559–573 561

and its source code. We would still mean rack in all thesecases and our technique does not vary in any way.

The remainder of this report is organized as follows. Sec-tion 2 briefly describes the related work with regard to en-hancements in job scheduling from an energy conservationpoint of view. In Sect. 3 we provide the current rack levelpower statistics. In Sect. 4 we introduce the Maui sched-uler and its high level algorithm. In Sect. 5 we briefly ex-plain the benefit of node allocation by considering the racklevel granularity. In Sect. 6 we propose our algorithms andgive details of its implementation with reference to the MauiScheduler. We discuss the issue of the network switches inSect. 7. In Sect. 8 we describe the experimental setup andSect. 9 describes the results obtained. We then conclude ourwork in Sect. 10 and also discuss some possible future en-hancements.

2 Related work

Various techniques have been proposed for improved powermanagement in HPC clusters. SLURM [5] is a widely usedresource management and scheduling system for supercom-puters. It has a power management facility to put the idlenodes in a lower power state. It has facilities to contain thesurge in the workload and alters the node states gradually.However, not much research has been done to provide a bet-ter power management policy [2].

In [6], Pinheiro et al. proposed a load concentration pol-icy. This turns on or off cluster nodes dynamically accord-ing to the workload imposed on the system. Chase et al. [7]take this further and propose Muse which uses Service LevelAgreement (i.e., SLA) to adjust active resources by makinga trade-off between service quality and power consumption.These works can be classified as dynamic cluster reconfig-urations. Chen et al. [8] applied dynamic voltage frequencyscaling techniques along with dynamic cluster reconfigura-tion to achieve much more improvements.

Scheduling solutions based on the use of virtualizationhave also been proposed. The general idea is to utilizeVM consolidation or intelligent node allocation and achieveoverall energy conservation [2]. Verma et al. [9] investigatedaspects of power management in HPC applications by deter-mining the VM placements based on CPU, cache and Mem-ory footprints. Dhiman et al. [10] proposed a multi-tieredsoftware system called as vGreen, a solution which takesmigration overhead into consideration. Nathuji and Schwan[11] suggested Virtual Power, which defines virtual powerstates for the servers based on their scheduling policies aswell as the CPU frequency. They used power managementhints provided by the guest OS to implement global or localpower policies across the physical machines. In [23] Herme-nier et al. propose a consolidation manager, Entropy, which

uses constraint programming (CP) to solve the VM place-ment problem. The idea is to define the consolidation prob-lem as a set of constraints and use a standard library suchas Choco to solve the problem. Beral et al. [24] use a ma-chine learning based approach for power modeling wherethey predict the power requirements for a VM and make VMplacement and migration decisions based on this prediction.Most of these server consolidation techniques focus on uti-lizing as few nodes as possible, but they do not concentrateon the packing of nodes based on actual node locations.

Topology aware scheduling considers the scheduling ofjobs, based not only on the properties of the requestedmachines but also on the data center properties. However,largely this has been to do with machine interconnections.Gabrielyan et al. [12] discuss one such strategy, but concen-trate on inter-node collective communication aspects. Therehas also been significant work in thermal management suchas thermal management system by Heath et al. [13] and astudy of temperature aware workload placement by Mooreet al. [14].

In [27], Ranganathan et al. propose an ensemble (en-closure) level power management scheme. However theymainly focus on power capping at the enclosure level.A controller exists at the enclosure level which is respon-sible to ensure that the total power consumption of the en-closure does not cross a predefined limit. This scheme doesnot consider a cluster wide view and is in-effective from aparallel job scheduling perspective.

Most of these cluster wide techniques consider the phys-ical node level granularity. None of them view the data cen-ter from rack perspective. We show that we can consider therack level granularity and achieve additional energy savings.Our technique can be applied in tandem with most of theseother techniques. Thereby we can incorporate our techniqueinto existing data center installations and realize the addi-tional power conservation benefits. Our work also quantifiesthe rack level power statistics and creates a new opportunityfor further research in this area.

3 Rack level power statistics

We gathered statistics from a commercially running HPBladesystem p-class enclosure [15]. The HP Bladesystemp-class enclosure provides a number of monitoring tools forviewing the runtime power consumption information for thedifferent rack components. We gather the runtime powerstatistics of such racks using such tools [16]. Primarily wehave used the web portal for monitoring the power statisticsof these rack components. As the power consumption infor-mation of these rack components remained almost constantat their full stated power we did not need a more dynamicmonitoring tool for further analysis.

Page 4: Rack aware scheduling in HPC data centers: an energy conservation strategy

562 Cluster Comput (2013) 16:559–573

The HP Bladesystem p-class enclosure is one of thewidely used enclosures in today’s data centers. This infor-mation is typical of any other rack installation and scalesevenly across different racks. We utilize this statistics to es-timate the energy savings achieved by the implementationof our technique.

The HP p-class enclosure that we considered can sup-port up to 16 blades and has 10 enclosure fans. The fansare part of the enclosure and are part of the cooling sys-tem. The fans together consume about 500 W of poweron an average. The interconnect bays connects the nodeswithin a rack. They consume about 133 W of power. Thesepower consumptions are immaterial of the node states. Thesame amount of power is consumed even when the nodesare turned off completely. There might be other compo-nents associated with some other enclosures. These compo-nents might also be deriving considerable power from therack’s PDU. For our study we restrict to just the fans andthe interconnect bays. Nevertheless if such components ex-ist we would achieve much more savings than shown in thiswork.

In addition, typically there are blowers associated witheach rack which help to bring down the rack temperature.We can turn off these blowers by the use of our techniqueand conserve much more energy. For simplicity, we do notinclude the savings from the blowers in our results, whichwe consider to be also significant. We also exclude the en-ergy consumed/savings associated with nodes for our dis-cussion.

We argue that turning off the racks through the use of re-motely controlled power distribution units can significantlyenhance the power conservation. At the time of this workPDU’s such as APC’s Switched Rack PDU [17] were avail-able for use. The APC’s PDU allows for remotely manag-ing the PDU and can allow for cutting off the power supplythrough defined power outlets in the PDU. For more detailsregarding the switched rack PDU’s please refer to the prod-uct overview in [17].

4 Maui scheduler

The Maui scheduler is one of the widely used schedulersin HPC clusters [4, 19]. Like many other batch schedulers,Maui determines which job needs to be run when and whereand informs the resource manager. Torque is one of the mostcommonly used resource managers [20]. Its function is toissue commands in interest of Maui’s scheduling decisionsand also provide up to date information about the cluster.Torque also acts as an interface for the users to interact withtheir jobs. Thus Maui learns the job and user informationfrom Torque.

Maui is well known for its highly configurable compo-nents. Changing job priorities based on adjustable parameterweights and node allocation policies are some of the thingsamong many that are relevant for our discussion. Apart frombeing run along with a resource manager such as torque,Maui can also be run in simulation mode, where it can simu-late years of workload in just a few hours. Thereby, Maui isa very powerful tool to study scheduling of different work-loads in HPC clusters.

At every periodic time interval Maui performs one cy-cle of the following steps, which is called as a schedulingiteration [4]. (Only the steps relevant to our discussion aredescribed here)

Change the priorities of the jobs in the job input queue.(Job Priority polices like weights to queue time, quality ofservice, usage, etc. are applied here)

(a) Choose a batch of jobs from the input job queue basedon the new priorities.

(b) Allocate nodes for this batch of jobs one at a time basedon the node allocation policy and inform the resourcemanager.

(c) Backfill—a scheduling optimization.

We are interested in step (c) of the scheduling iteration de-scribed above. We implement our algorithm as a node allo-cation policy plug-in. We also discuss a possible enhance-ment to step (a) which affects the job priorities. Before wediscuss the concept and the implementation details we dis-cuss the existing default node allocation policies that arewidely used [27]. We also compare our node allocation pol-icy with the default node allocation policies as we under-stand that most of the cluster schedulers are used in theirdefault configurations [27]. Below we discuss some of theexisting node allocation policies used in Maui [18] for com-pleteness.

4.1 Node allocation policies

Node allocation is primarily the task of selecting a set ofnodes for a job that is about to be scheduled.

• First Available:In this node allocation algorithm the nodes are selectedbased on the first available node for the job. The orderin which the nodes are selected is based on the order inwhich the nodes are provided by the resource manager.This node allocation scheme is extremely fast.

• Max Balance:This node allocation algorithm primarily selects the nodesbased on the most balanced set of nodes for the job interms of a predefined metric. The metric in most casesis CPU speed. This is mostly relevant for heterogeneous

Page 5: Rack aware scheduling in HPC data centers: an energy conservation strategy

Cluster Comput (2013) 16:559–573 563

clusters where you have nodes varying in different CPUspeeds.

• Fastest:In this node allocation algorithm, nodes are selected byprocessor speed. The nodes with the fastest speed are se-lected first.

• CPU Load:The nodes are selected with the goal of even distributionof CPU load among the various nodes assigned for thetask.

• Min Resource:The nodes selection is done on the basis of the node prop-erty that satisfies the minimum requirements for the job.This is one of the widely used node allocation policies[27]. This is more like a best fit in terms of satisfying theresource requirements for the job.

• Local:This node allocation policy invokes any local node allo-cation policy that is custom for the cluster. We utilize thisnode allocation policy for conducting our experiments.The scheduler makes entry into our code through thisnode allocation policy. We talk about the precise imple-mentation of this extension in the implementation sec-tion.

4.2 Simulation mode

Another primary reason for choosing the Maui scheduler isthat it has an in-built simulator which will help us to simu-late years of cluster workload logs in just a few hours. Webriefly describe the simulation mode of Maui here.

The simulation mode of Maui takes two input files.

• Resource trace file:This file contains the cluster resource information onwhich the scheduling is performed. The file has everynode’s information with their node speeds, amount ofmemory and many other configurable parameters. Wehave created one such file for performing our simulations.The interesting feature here is that we can modify the re-source trace file to suit the average cluster utilization thatwe need for running our simulations for constant work-loads. This feature allows us to simulate a wide variety ofworkloads with different cluster utilizations. We later per-form a majority of our simulations at a cluster utilizationof (45 % to 50 %) as justified previously in the discussionof low utilization clusters.

• Workload trace file:This file contains the workload traces that we wish to sim-ulate. The workload trace file follows a format defined in[18]. A number of workload trace files are available in

the parallel workload archive [21]. These traces are fromreal world clusters spanning years of cluster usage. Someamount of cleanup is required (usually done with awkscripts) to utilize these workload logs.

The Maui scheduler in the simulation mode generates no-table statistics at the end of the simulation. This statisticalinformation helps us to understand how the scheduler per-formed for the defined configuration and trace files. There isalso a log level configuration setting that helps to emit outuseful custom statistical information during the simulation.These logs help us to derive certain conclusions of our pro-posed technique.

The Maui scheduler also allows for a number of com-mands that can be used during the simulation run to under-stand how the scheduling is being performed. These com-mands help us in controlling the simulation flow as well.They also aid in to view the node allocation at any schedul-ing iteration during the simulation run. We detail one suchcommand’s output in the results section.

5 Concept

Our goal is to bring rack awareness into the scheduler.Mainly we would want to allocate nodes more intelligentlyto a particular job which will reduce the total number ofracks being utilized during the job’s execution.

Consider for example a cluster depicted by the above fig-ure. The cluster has 3 racks with 3 nodes each. Each rack hasa set of free nodes and occupied nodes. The occupied nodesare being utilized by currently running jobs.

Suppose at this time a new job Jt has to be allocated nodeson the cluster. Assume Jt requires 3 nodes to start running.If we allocate one node from each rack, we will be using allthe three racks which will increase the power consumptionby an additional amount due to the rack components that wedescribed in Sect. 3. Had we allocated all the 3 nodes fromthe rack with Rack Id 1, we would still end up utilizing allthe three racks. This would be how a possible rack unawarejob scheduler would perform the node allocation for the jobJt.

However, a rack aware scheduler like ours would choose1 node from the rack with Rack Id 2 and 2 nodes from therack with Rack Id 3, resulting in keeping the rack with RackId 1 shut off. Thereby, we can have more energy savings.Our algorithm takes advantage of this form of node alloca-tion as shown in the following section.

The above concept though makes perfect sense for per-forming node allocation, there are certain limitations whichneeds to addressed. The above mentioned concept worksnicely without any modifications for jobs with single noderequirement or with requirement for parallel jobs with littleor no communication between its composite tasks. However,

Page 6: Rack aware scheduling in HPC data centers: an energy conservation strategy

564 Cluster Comput (2013) 16:559–573

for parallel jobs that tend to be communication intensive ourtechnique might end up allocating nodes on different racks.As [25] observes that there is about 15 % decrease in jobperformance due to job splitting across multiple racks. Thisis not significant considering the fact that not all jobs arecommunication intensive. Nevertheless, we propose a mod-ified scheduling technique which tries to reduce this effectof job splitting. One simple solution is to pack the job intoa single rack, i.e. allocating nodes from the same rack toa particular job. This will reduce the effect of job splittingconsiderably. However, it is not always possible to allocatenodes from a single rack and we might as well end up start-ing a switched off rack for the new job. Hence we improviseon this and propose a technique which will merge both thesetechniques and tries to achieve a balance between these twoextremes. We discuss these techniques in the following sec-tions.

6 Algorithm and implementation details

Below we discuss our first algorithm based on maximum re-maining time strategy. We discuss its drawbacks and pro-pose another algorithm which selects nodes in a best fitscheme thereby packing jobs into a single rack wheneverpossible. We discuss the drawbacks of the best fit algorithmand propose a merge of the above two algorithms.

6.1 Maximum remaining time algorithm

We formulate and implement the maximum remaining timealgorithm as a node allocation policy in the Maui sched-uler. The reason we choose the rack with the maximum re-maining time is that, we expect that particular rack to re-main occupied for longer period of time when compared toany of the other racks. Thus by choosing the rack with thejob that has longer maximum remaining time, we will keepcertain number of racks always utilized. To avoid this situ-ation we make certain additions to the proposed algorithm.

ALGORITHM: Allocate Nodes to a job Jt based onmaximum remaining time.

INPUT: Node Requirements for job Jt, Rack occup-ancy.

OUTPUT: Node Allocation for job Jt.

1. Categorize the racks as utilized (U) and not-utilized(NU).

2. For each rack ‘i’ in U {max_remaining_time = <Zero>;For each node ‘N’ in Ui {

Nt = getRemainingTime(Job(N));

If( isRunning(N) andNt > max_remaing_time) {max_remaining_time = Nt

}}Ui = max_remaining_time;

}

3. Sort U based on computed max_remaining_times.

4. Allocate nodes from the rack using sorted U (withthe max remaining time first and then with lessermax remaining time).

5. If the request is still not satisfied we allocate fromthe not-utilized racks NU.

The additions to the algorithm keep track of the rack uti-lization over time and categorize the racks as utilized, over-utilized and relaxed. A utilized rack is one which is currentlybeing utilized and is well within the threshold of being uti-lized heavily. An over-utilized rack is one which is used sub-stantially and no more scheduling should be done unless in-evitable. A relaxed rack is one which was in over-utilizedstate and needs to be in an idle state for some time. Theseadditions are incorporated to reduce the repeated usage ofthe same hardware. These additions do not cause notewor-thy deviation from our results. Therefore for simplicity ofthe discussion we exclude this aspect.

This algorithm affects the STEP (c) of the Maui schedul-ing iteration. Maui provides a “LOCAL” Node Allocation

Page 7: Rack aware scheduling in HPC data centers: an energy conservation strategy

Cluster Comput (2013) 16:559–573 565

Policy, which calls a well-defined function MlocalJobAllo-cateResources() with input parameters being a list of eligi-ble nodes and job details (including the job’s node require-ments). This function is called when we need to allocatenodes for a job that is selected to run based on its priority.We extend this particular function to implement the abovedescribed algorithm.

We also model the racks into the frame data structureof Maui. The frame data structure till date mainly servedthe purpose of node organization, for displaying the nodeinformation properly. By using this data structure of Mauimost of the scheduler code remains untouched. We have alsoadded code to gather and generate runtime statistics relatedto the energy consumption of the cluster. We take care thatthe statistics generation code is not included in the schedul-ing time calculation, which is discussed in detail below.

The drawback of the above scheme as mentioned in theprevious section is the job splitting that occurs due to theway in which we make node selection. This can result inloss in job performance. The percentage reduction in per-formance for such jobs is around 15 % as indicated in thisstudy [26]. Hence we propose our next algorithm which is abest fit rack aware algorithm which essentially tries to packthe job into a single rack.

6.2 Best fit algorithm

In this algorithm we primarily note down the number of un-allocated nodes in each rack. We then sort the racks based onthe difference between the number of remaining nodes andthe number of required nodes. We primarily are assigningranks to the racks by doing this. The rack with the highestrank is the one which has the least difference. It primarilymeans that it can fit the rack more precisely.

We can see one such sample allocation taken during thesimulation run. We can see that the nodes are more or lessallocated contiguously in one frame (rack). There are somejobs that are allocated across different frames. This is be-

cause at the time of the node allocation for that particularjob the scheduler could not find a best fit for it. This leads tofragmentation of the job across multiple racks as shown inthe figure below for job “S”. However, it still shows that agood number of jobs are allocated in a single rack.

The drawback of this approach is that many times wedo not find a suitable rack or in other words there are notenough nodes in any of the racks. In such cases the bestfit will select an empty rack driving up the rack utilization.This essentially increases the energy consumption as nowwe have more racks that are kept on than needed. However,it does solve the problem of job splitting considerably asshown in the results section.

The best fit scheme tries to overcome the limitations ofthe maximum remaining time scheme. However, primarilyit only affects jobs which request for multiple nodes and

ALGORITHM: Allocate Nodes to a job Jt based on bestfit strategy.

INPUT: Node Requirements for job Jt, Rack occu-pancy.

OUTPUT: Node Allocation for job Jt.

1. Determine the number ‘Nv’ of vacant nodes in eachrack. We call this set V .

2. For each rack ‘i’ {Si = Abs(Vi – Node_Requirement(Jt ))}

3. Sort ‘S’.(The closer the number of nodes in the rack is to thenumber of required nodes the smaller the differenceis.)

4. Allocate the nodes from the rack with the minimumdifference first and then with higher differences us-ing S.

Page 8: Rack aware scheduling in HPC data centers: an energy conservation strategy

566 Cluster Comput (2013) 16:559–573

are communication intensive. The number of communica-tion intensive jobs occurring in a typical workload is en-tirely dependent on the kind of workload the data centerprocesses. Hence, if the number of communication intensivejobs is considerably lower we can use the maximum remain-ing time algorithm. On the other hand if it tends to be veryhigh we can use the best fit scheme. However, it is expectedthat any real workload has some amount of communicationintensive jobs. We can then use a solution that merges theabove two algorithms into one. We shall discuss one suchpossible algorithm in the next section.

6.3 Merged rankings algorithm

In the first two algorithms we are essentially assigning ranksto each rack based on certain properties of the jobs that it iscurrently running (or its current state). In the merged rank-ings we still use the rankings from the previous two algo-rithms. We merge the two rankings in this manner to com-pute the new ranking for each rack.

Rmerged = α∗Rmax_rem_time + β∗Rbest_fit

Here α is the parameter that determines the prominence ofthe maximum remaining time algorithm. β is another pa-rameter that determines the prominence of the best fit algo-rithm. We set the values of α and β to be between 1 and100. We also place the condition that they must sum up tohundred to have a sense of which algorithm is having theeffect. The smaller the values of the parameters the more ef-fect they have on the final rankings. We then sort the racksagain based on the freshly computed rankings. We use thissorted list to perform the node allocations. The results showthe variation in the number of jobs split by the use of thistechnique. Primarily the motivation for leaving them as pa-rameters that can be tuned is that different data centers servedifferent kinds of workloads. We can then adjust these pa-rameters based on the kind of workload the scheduler wantsto schedule in the data center.

An adaptive approach to this scheme can also be de-ployed where the parameters are adjusted based on theworkload being expressed during the scheduling of tasks.This would involve a feedback loop where the parametersare continuously tweaked to maximize the energy savings.This technique is suitable for varying workloads. A moresophisticated technique where a number of non-linear ba-sis functions can be utilized to adjust the parameters in theequation. We only provide manual adjustment of such pa-rameters in our results as a proof of concept.

7 Effects of parallel job splitting

One of the effects of parallel job splitting is the increase inthe wall time of the parallel job that has been split across

multiple racks. This is mainly due to delay in the communi-cation through the switch that interconnects the racks. Thisis because inter-rack switches are considered much faster.This effect is mainly observed for parallel jobs that are com-munication intensive. Non parallel jobs or parallel jobs thathave very little communication among its processes, will notcontribute to the increase in the energy consumption. Hencewe perform a systematic study of how such communicationintensive jobs could affect the energy consumption.

We consider two factors related to this aspect of schedul-ing. Firstly we briefly discuss the effect of the job splittingon the power consumption in network switches. Secondly,we consider the wall time stretch and discuss a possible im-plementation to increase the wall time of a job that has beensplit by our scheduling.

7.1 Power consumption in network switches

As discussed in the previous section our technique results insome amount of job splitting. We constantly see in our re-sults that the number of job splits seen is not extraordinarilydeviant from the current node allocation schemes. However,we still discuss this issue and provide some more data from[27] to assert our technique. The authors make a study of thevarious switches and effect of network traffic on the powerconsumption in the switches. As due to rack aware schedul-ing we expect some amount of job splitting, which in turnwould result in increase in network traffic across the splineswitches(switches that connect two different racks are calledspline switches). In this section we try to understand the in-crease in the energy consumption due to the increase in thenetwork traffic.

From [27] we gather that the network switches accountfor about 15 % of the total energy consumption in a datacenter. This is significant amount of energy consumed. How-ever, we need to study the effect of the variation in the loadon these switches. The authors perform a number of experi-ments and run their custom benchmarking tools. They claimthat the power consumption is not affected much by the in-crease in the port utilization. The only factor that drives upthe power consumption is the number of active ports beingutilized. The more the number of active ports in a switch thehigher the power consumption.

Now with respect to our proposed technique we can seethat the number of active ports interconnecting the racksremains constant and must always be kept in “ON” stateregardless of the traffic flowing through it. If the rack isswitched off entirely we can consider switching off the ac-tive ports associated with it. In fact this gives us an opportu-nity to direct switching “OFF” of the active ports for thoseracks which we would want to switch off bringing down thepower consumption by the switch. The job splitting mightincrease the amount of network traffic flowing through the

Page 9: Rack aware scheduling in HPC data centers: an energy conservation strategy

Cluster Comput (2013) 16:559–573 567

switch. However this would not increase the power con-sumption thereby does not cut down on the savings gained.

7.2 Accommodation for wall time stretch

Though the job splitting does not lead to significant increasein the energy consumption as discussed in the previous sec-tion, it does lead to increase in wall time for certain paralleljobs that are communication intensive. Hence we need toaccommodate for this increase in the wall time requirementfor a job. We modify the Maui scheduler to increase the walltimes for those jobs that are split across the racks.

The most obvious question of by how much should weincrease the wall times for the jobs needs to be addressed.Another aspect to consider is whether we should increase thewall time for all the jobs or for only those which are commu-nication intensive. As indicated previously in our discussion,[25] observes that there is about 15 % decrease in job perfor-mance for jobs that are split across multiple racks. Hence wejust need to increase the wall time of the jobs that are splitacross the racks by about 15 %. In the Maui scheduler wemodify two important variables, viz. WCLimit and SimWC-Time. The second variable needs to be adjusted only whenrunning in the simulation mode. There are a couple of othervariables (properties/attributes of the submitted job) that weneed to adjust to make the wall time increase for the newjobs.

Another issue that we need to consider is for which jobswe should increase the wall time. Not all jobs that are splitacross the racks will need their wall times increased. In realworld clusters, the percentage of communication intensivejobs varies widely. It depends on the nature of the workloadthe cluster is serving. Hence for our study we create a con-trol that allows us to adjust the percentage of communica-tion jobs in the workload trace and thereby study the effectthis variable percentage has on the energy savings achievedthrough our techniques. In the implementation we try to ad-just the wall times of the split jobs randomly and take carethat we maintain the desired percentage of communicationjobs. In real world clusters we could possibly use profile in-formation of the running jobs to decide whether they arecommunication intensive or not. We would then adjust thewall times only for such jobs. Profile information is typi-cal in HPC environments as the same programs are typicallyrun with varying inputs. Nevertheless, determining preciselyif a particular job is communication intensive or not is ahard problem and is difficult to be done while schedulingthe job.

8 Experimental setup

As mentioned before the simulation mode of Maui is a use-ful feature which helps to simulate different workloads for

various cluster and scheduler configurations. We use theMaui scheduler in this mode to test our implementation.

In the simulation mode Maui accepts a resource trace fileand workload trace file, which depict cluster configurationand workload logs respectively. We utilize a reduced ver-sion of the workload log of HPC2N [21], which is about1.5 years of cluster log. The HPC2N is a 120 node clusterwith 2 processors per node. We generate the resource tracefile for different cluster utilizations for our experiment. TheHPC2N log from the parallel workload archive has a num-ber of anomalies that we had to clean up using awk scriptsto make it usable as a workload trace file.

As we have discussed before we see a number of clus-ters in the real world that have about 40 to 50 % cluster uti-lization. Hence we simulate a number of experiments withabout 45 % cluster utilization. Wherever not mentioned as-sume this said percentage of average cluster utilization.

For the merged rankings algorithm we need to adjust thecorrect values of α and β parameters. For this we try outwith different values for these parameters and determine thatwhen α = 90 and β = 10, we get maximum returns in sav-ings. Hence we assume that in the following results the val-ues for α and β are 90 and 10 respectively wherever themerged rankings algorithm is utilized.

The results of the effect of the wall time stretch have beenindicated at the end of the next section. The effect of walltime stretch is compared only amongst our proposed tech-niques as we consistently prove that the other techniques areno better in terms of the number of jobs splits. The numberof job splits is not a true indicator when it comes to relatingit to energy savings. Hence for this include the wall timestretch and measure the energy savings for our proposedtechniques.

9 Results

Figure 1 shows the utilizations of different real world clus-ters. The log information is sourced from various academicand commercial HPC installations [20–22] (Table 1). Weuse the Maui scheduler in its simulation mode to simulatethese different workload logs. The Maui scheduler in thismode provides various useful statistics regarding the corre-sponding workload log. We see that most of these clustersare around 40 % to 50 % utilization and rarely do we findhigh utilization clusters. Hence we can safely assume thatthere are many clusters in the real world which have 40 %to 50 % utilization. On an average they have about 45 % uti-lization. We shall see that our algorithm is very well suitedfor clusters which fall in this range of utilization. We havealso used this assumption for few of our experiments. Tomention we compare our algorithm with existing node allo-cation policies at 45 % cluster utilization.

Page 10: Rack aware scheduling in HPC data centers: an energy conservation strategy

568 Cluster Comput (2013) 16:559–573

Fig. 1 Cluster utilizations dataof academic and commercialclusters

Fig. 2 Energy savings fordifferent percentages of clusterutilizations

Table 1 Academic/Commercial clusters [21]

1 NASA iPSC 8 KTH SP2 15 DAS2 fs1 22 SHARCNET 29 NCAR-BlueskyB8

2 LANL CM5 9 SDSC SP2 16 DAS2 fs2 23 LLNL Atlas 30 NCAR-Dave

3 SDSC Par95 10 LANL O2K 17 DAS2 fs3 24 NCAR-Babyblue 31 NCAR-Dataproc

4 SDSC Par96 11 OSC Cluster 18 DAS2 fs4 25 NCAR-BlackForest 32 NCAR-Mouache

5 Early CTC SP2 12 SDSC Blue 19 SDSC DataStar 26 NCAR-BlackForst(2) 33 NCAR-Chinook

6 CTC SP2 13 HPC2N 20 LPC EGEE 27 NCAR-Bluedawn 34 NCAR-Chnookfe

7 LLNL T3D 14 DAS2 fs0 21 LCG 28 NCAR-BlueskyB32

Figure 2 shows how much energy is saved for differentpercentages of cluster utilization by the use of the maximumtime remaining technique. We use the same HPC2N work-load trace file, but change the cluster configuration (numberof nodes) via the resource trace file to obtain the savings fordifferent utilizations. We see that our algorithm gives con-siderable savings at lower and middle levels of cluster uti-lization and the savings decay as the utilization rises. Mostof the real world clusters fall in the lower and middle levelsof cluster utilization as discussed below.

We define rack utilization as a metric to measure thenumber of racks utilized over the complete period of theworkload. Higher the rack utilization, more energy will beconsumed. Figure 3 shows the rack utilization comparison

of maximum remaining time algorithm with other node al-location policies currently existing in Maui. The First Avail-able node allocation policy simply selects the nodes it findsfirst and matches the requirements. The Max Balance allo-cates the nodes with a balanced set of node speeds. In a ho-mogeneous cluster it simply falls through First Available.The fastest selects the fastest node first. The CPU load se-lects nodes based on the current CPU node. In min resourcethose nodes with the fewest configured resources which stillmeet the job’s resource constraints are selected. We see thatmaximum remaining time algorithm consistently performsbetter than any other node allocation policy.

Figure 4 shows the energy savings in terms of MegawattHours (MWh) in comparison with other node allocation

Page 11: Rack aware scheduling in HPC data centers: an energy conservation strategy

Cluster Comput (2013) 16:559–573 569

Fig. 3 Rack utilizations fordifferent node allocationpolicies Rack Aware (1), FirstAvailable (2), Max Balance (3),Fastest (4), CPU Load (5), Minresource (6)

Fig. 4 Energy savings fordifferent node allocationpolicies Rack Aware (maximumremaining time) (1), FirstAvailable (2), Max Balance (3),Fastest (4), CPU Load (5), Minresource (6)

policies described above. We see that our algorithm givesabout 7 % more energy savings than the existing best nodeallocation policy (in terms of energy savings). It gives about14 % more savings when compared to Min Resource whichis the default node allocation policy for Maui.

Figure 5 shows the comparison of the average schedulingtime for the different node allocation policies. The maxi-mum remaining time algorithm does not lead to any signifi-cant increase in the scheduling time. It remains on par withother policies.

Figure 6 compares the change in number of racks for thedifferent node allocation policies. We see that our techniquedoes not lead to frequent changes in the number of racksutilized in consecutive scheduling iterations as compared toother policies. This means that we will have sufficient timeto perform the rack power on/off and would keep the clus-ter utilization stable enough as compared to the other poli-cies. This does not mean that we have to take the rack poweron/off decisions at every scheduling iteration. We suggestdoing it at some small multiple of scheduling iterations.

Figure 7 shows the savings for different number of nodesper rack, cluster configuration. We see that the savings dueto interconnect bays remain more or less consistent, whereasthe savings due to the fans increases as the number of nodesper rack decrease. This is expected as the number of fans isin direct proportion to the number of nodes in a rack.

The above results are all consistent with the other two al-gorithms as well in similar proportions. We now discuss theeffects of the enhancements made to the maximum remain-ing time algorithm.

The following Table 2 provides the information about thenumber of jobs that were actually split across different rackswhen performing the node allocation. Any job that has oneor more nodes spread across multiple racks is counted.

The total number of jobs that were scheduled is 263683.We can see that there is considerable amount of job splittingthat happens due to the maximum remaining time technique.The best fit algorithm has the minimum number of job splits.This is because the algorithm inherently tries to avoid thesplitting of job across multiple racks. The merged rankings

Page 12: Rack aware scheduling in HPC data centers: an energy conservation strategy

570 Cluster Comput (2013) 16:559–573

Fig. 5 Average schedulingtimes for different nodeallocation policies RackAware (1), First Available (2),Max Balance (3), Fastest (4),CPU Load (5), Min resource (6)

Fig. 6 Number of changes inrack count effected by thedifferent node allocationpolicies at every iteration

Fig. 7 Savings for differentnumber of nodes per rackcluster configuration 8 racks-30nodes per rack (1), 12 racks-20nodes per rack (2), 16 racks-15nodes per rack (3), 20 racks-12nodes per rack (4)

algorithm shows number of splits somewhere between thatshown by best fit and maximum remaining time algorithms.The different values of α and β vary the number of job splits

seen. We see that for high values of β which means highweightage to the best fit strategy drives down the number ofsplits incurred.

Page 13: Rack aware scheduling in HPC data centers: an energy conservation strategy

Cluster Comput (2013) 16:559–573 571

Fig. 8 Energy savings fordifferent percentage ofcommunication intensive jobs

Table 2 Number of jobs split for different node allocation policies

S.NO Algorithm (node allocation policy) Number of jobs thatwere split across racks

1 First Available 108359

2 Maximum Remaining Time 119349

3 Best Fit (rack based) 98437

4 Merged Rankings (α = 90, β = 10) 102677

We also compare the number of jobs that were split withthe First Available Node Allocation Policy. We notice thatthe number of jobs split by the First Available Node Alloca-tion policy is comparable to our schemes and we do not de-viate much in terms of the number of jobs being split acrossthe multiple racks. We believe we do not deviate much fromexisting node allocation techniques in this aspect and hencewould achieve some savings regardless of the issue of thejob splitting.

We further study the issue of job splitting by includingthe effect of the wall time stretch from here on. Figure 8shows the comparison for the three different proposed algo-rithms for different percentages of communication intensivejobs. For 0 % communication intensive jobs the energy sav-ings is the same as the one that we would observe withoutthe wall time stretch accommodation. The 100 % commu-nication intensive job percentage would have the wall timeadjusted for every job that is split across multiple racks. Wesee that the Best Fit (rack wise) algorithm consistently per-forms poorer when compared to the other two algorithms.This is because though the jobs are not split extensively ascompared to the other two, the difference in the number ofsplits (and thereby the wall time stretches) is not enoughto compensate for the minimum number of racks used. TheMerged rankings algorithm follows closely with the max-imum remaining time algorithm in terms of savings. How-

ever, we see that the savings are slightly more for about 60 %of communication intensive jobs.

And lastly Fig. 9 correctly shows the increase in the walltimes for the jobs as the communication job percentage in-creases. We measure the total processing hours for all thejobs in the cluster. We see that the best fit algorithm has min-imum number of processing hours compared to the others.Nevertheless, minimum number of processing hours doesnot indicate maximum energy savings as indicated by theprevious graph. This could be due to various other factorssuch as the increase in the number of racks being utilizedwhich drives up the energy consumption. It could also bedue to the deviation from the maximum remaining time con-straint, which pretty much ensures that the highly utilizedrack will be utilized again. Nevertheless there is not muchdeviation in terms of the increase in the number of process-ing hours due to wall time stretches amongst the three pro-posed algorithms.

The graph also highlights that the increase in the walltime stretch results in reduction in energy savings as thecluster runs for longer period of time. This is consistent withthe previous graph on energy savings for varied percentagesof communication intensive jobs.

10 Conclusion

This study demonstrates that at rack level granularity wecan further enhance existing energy aware scheduling tech-niques and achieve significant energy conservation. We haveconsistently demonstrated savings of 7 % to 14 % over ex-isting node allocation schemes. We have also seen that thesetechniques can be applied in tandem with many other exist-ing energy conservation schemes.

The study identifies the power hungry rack componentsand provides a measure of the power consumption statis-tics by these different rack components. We proposed and

Page 14: Rack aware scheduling in HPC data centers: an energy conservation strategy

572 Cluster Comput (2013) 16:559–573

Fig. 9 Processing hours fordifferent percentage ofcommunication intensive jobs

implemented three algorithms for realizing the rack aware-ness in one of the batch schedulers, viz. Maui. The first onebeing the maximum remaining time algorithm which is af-fected by the job splitting problem discussed above. Thebest fit algorithm is proposed that overcomes the limitationsof the first algorithm. However, we achieve lesser savingsand end up driving up the rack utilization. We propose amerged rankings algorithm which we believe would provideus with maximum energy savings. We have seen that we canadjust the parameters provided to further tweak the mergednode allocation algorithm to achieve enhanced energy sav-ings.

We also focused on the issue of job splitting across mul-tiple racks and claim that our technique performs similarlyto the existing node allocation policies. We also discussedthe issue of the network switch power consumption and theeffect of job splitting on the energy consumption by theseswitches. We note that our technique does not in any wayincreases the power consumption in these switches. On thecontrary we could turn off the active ports associated withthe rack and conserve energy. This would be a simple exten-sion to our existing technique.

We also implemented a technique of increasing the walltimes of the communication intensive job. This would ac-count for the loss in performance due to job splitting. Wemade a comparative study among various existing and pro-posed techniques with regard to the total energy conservedby deploying our technique. We also provided a way toadjust the parameters based on the nature of the work-load. This would be quiet useful in the real world deploy-ments.

As further improvements we can modify the job prioritiesbased on the existing node allocation patterns in the racks.We also need to provide a quantification of the energy sav-ings due to the blowers associated with the racks. We believethese savings would also be significant.

References

1. Report to Congress on Server and Data Center Energy EfficiencyPublic Law 109-431. U.S. Environmental Protection Agency EN-ERGY STAR Program, August, 2007

2. Komey, J., Belady, C., Patterson, M., Santos, A., Lange, K.-D.:Assessing trends over time in performance, costs and energy usefor servers. LLNL, Intel Corporation, Microsoft Corporation andHewlett-Packard Corporation. Released on the web on August 17,2009

3. Liu, Y., Zhu, H.: A survey of the research on power managementtechniques for high-performance systems. Softw. Practive Experi-ence J. 40(11) (2010). doi:10.1002/spe.v40:11

4. Jackson, D., Snell, Q., Clement, M.: Core algorithms of the Mauischeduler. SprinkerLink, January 01, 2001

5. LLNL, H.P., Bull: The simple Linux utility for resource man-agement (SLURM). Available at http://www.llnl.gov/linux/slurm/.Revision 2.0.3, June 30, 2009

6. Pinheiro, E., Bianchini, R., Carrera, E., Health, R.: Load balancingand unbalancing for power and performance in cluster-based sys-tems. Technical report dcs-tr-440, Department of Computer Sci-ence, Rutgers University, May, 2001

7. Chase, J., Aderson, D., Thakar, P., Vahdat, A., Doyle, R.: Manag-ing energy and server resources in hosting centers. In: Proceedingsof the 18th ACM Symposium on Operating Systems Principles(SOSP’01), Canada, October 2001

8. Chen, Y., Das, A., Qin, W., Sivasubramaniam, A., Wang, Q., Gau-tam, N.: Managing server energy and operational costs in hostingcenters. In: Proceedings of the 2005 ACM International Confer-ence on Measurement and Modeling of Computer Systems (SIG-METRICS’05), Canada, June 2005

9. Verma, A., Ahuja, P., Neogi, A.: Power-aware dynamic placementof HPC applications. In: Proceedings of the 22nd InternationalConference on Supercomputing (ICS’08), Greece, June 2008

10. Dhiman, G., Marchetti, G., Rosing, T.: vGreen: A system for en-ergy efficient computing in virtualized environments. In: ISLPED,California, USA, August 2009

11. Nathuji, R., Schwan, K.: VPM tokens: virtual machine-awarepower budgeting in datacenters. In: High Performance DistributedComputing, June 2008

12. Gabrielyan, E., Hersch, R.D.: Network topology aware schedulingof collective communications. In: 10th International Conferenceof Telecommunications, March 2003

13. Heath, T., Centeno, A., George, P., Ramos, L., Jaluria, Y., Bian-chini, R.: Mercury and Freon temperature emulation and manage-ment for server systems. In: ASPLOS, October 2006

Page 15: Rack aware scheduling in HPC data centers: an energy conservation strategy

Cluster Comput (2013) 16:559–573 573

14. Moore, J., Chase, J., Ranganathan, P., Sharma, R.: Temperature-aware workload placement in data centers. In: USENIX (2005)

15. HP BladeSystem p-Class Infrastructure Specification: http://h18004.www1.hp.com/products/quickspecs/12330_div/12330_div.html

16. HP Systems Insight Manager: version 6.217. Product Description of APC Switched Rack Power Distribution

Unit: http://www.apc.com/products/family/ind-ex.cfm?id=7018. Maui Scheduler Administrative Guide: Version 3.2. http://www.

clusterresources.com/products/maui/docs/mauiadmin.shtml19. Torque Admin Manual: Version 3.0. http://www-.clusterresources.

com/products/torque/docs/20. HPC2N Log from Parallel Workloads Archive: HPC2N is a

Linux cluster located in Sweden. http://www.cs.huji.ac.il/-labs/parallel/workload/l_hpc2n/index.html

21. Parallel Workload Archive: http://www.cs.huji.ac.il/-labs/parallel/workload/logs.html

22. SCD FY 2003: ASR. http://www.cisl.ucar.edu/docs/asr2003/-mss.html

23. Hermenier, F., Lorca, X., Menaud, J., Muller, G., Lawall, J.: En-tropy: a consolidation manager for clusters. In: VEE, Washington(2009)

24. Beral, G., Nou, J., Guitart, G.T.: Towards energy-aware schedul-ing in data centers using machine learning. In: e-Energy, Germany(2010)

25. Kandala, K., Subramoni, H., Panda, D., Vishnu, A.: Designingtopology-aware collective communication algorithms for largescale InfiniBand clusters: case studies with scatter and gather. In:IPDPS, Atlanta (2010)

26. Etsion, Y., Tsafrir, D.: A short survey of commercial cluster batchschedulers. Technical Report 2005-13, Hebrew University, May2005

27. Sharma, M., Ranganathan, B.: A power benchmarking frame-work for network devices. In: Lecture Notes in Computer Science,vol. 5550. Springer, Berlin (2009)

Vikas Ashok Patil received theMaster’s degree in Computer Sci-ence and Engineering in 2011 fromthe State University of New York,Buffalo. During this time his focushas been on power aware schedulingin HPC data centers, parallel pro-gramming and mapreduce program-ming. In the summer of 2010 he wasinvolved in research related to clus-ter scheduling and hadoop bench-marking at Computational ResearchLaboratories. Between 2007 and2009 he worked for IBM India Soft-ware Labs where he was part of a

research group that worked on leading edge enhancements for Web-sphere portfolio of products. He pursued his undergraduate degree inComputer Science at Sri Jayachamarajendra College of Engineering,Mysore. Currently he works for Factset Research Systems which is afinancial data provider.

Vipin Chaudhary Professor ofComputer Science, Center for Com-putational Research, and Engineer-ing and the New York State Centerof Excellence in Bioinformatics andLife Sciences at University at Buf-falo, SUNY; and the CEO of Com-putational Research Laboratories.Earlier he was the Senior Direc-tor of Advanced Development atCradle Technologies, Inc. where hewas responsible for advanced pro-gramming tools and architecture formulti-processor chips. From Jan-uary 2000 to February 2001, he was

with Corio, Inc. where he has held various technical and senior man-agement positions, finally as the Chief Architect. In addition, he is onadvisory boards of several startup companies.His current research interests are in the area of High Performance andBig Data Computing and its applications to scientific, engineering,financial, social, and medical applications; and Computer Assisted Di-agnosis and Interventions. He has been the principal or co-principalinvestigator on over 25 million in research projects from governmentagencies and industry and has published over 170 peer-reviewed pa-pers.Vipin received the B.Tech. (Hons.) degree in Computer Science andEngineering from the Indian Institute of Technology, Kharagpur, in1986, the M.S. degree in Computer Science, and the Ph.D. degree inElectrical and Computer Engineering from The University of Texasat Austin, in 1989 and 1992, respectively. He was awarded the presti-gious President of India Gold Medal in 1986 for securing the first rankamongst graduating students in IIT.


Recommended