Improving Preemptive Scheduling with Application...

Improving Preemptive Scheduling withApplication-Transparent Checkpointing in Shared Clusters

Jack Li, Calton PuGeorgia Institute of Technology

Yuan Chen, Vanish Talwar, Dejan MilojicicHP Labs

ABSTRACTModern data center clusters are shifting from dedicated sin-gle framework clusters to shared clusters. In such sharedenvironments, cluster schedulers typically utilize preemptionby simply killing jobs in order to achieve resource priorityand fairness during peak utilization. This can cause signifi-cant resource waste and delay job response time.

In this paper, we propose using suspend-resume mech-anisms to mitigate the overhead of preemption in clusterscheduling. Instead of killing preempted jobs or tasks, ourapproach uses a system level, application-transparent check-pointing mechanism to save the progress of jobs for resump-tion at a later time when resources are available. To reducethe preemption overhead and improve job response times,our approach uses adaptive preemption to dynamically se-lect appropriate preemption mechanisms (e.g., kill vs. sus-pend, local vs. remote restore) according to the progress ofa task and its suspend-resume overhead. By leveraging faststorage technologies, such as non-volatile memory (NVM),our approach can further reduce the preemption penalty toprovide better QoS and resource efficiency. We implementthe proposed approach and conduct extensive experimentsvia Google cluster trace-driven simulations and applicationson a Hadoop cluster. The results demonstrate that our ap-proach can significantly reduce the resource and power us-age and improve application performance over existing ap-proaches. In particular, our implementation on the nextgeneration Hadoop YARN platform achieves up to a 67%reduction in resource wastage, 30% improvement in overalljob response time times and 34% reduction in energy con-sumption over the current YARN scheduler.

General TermsManagement, Performance

KeywordsCloud computing, Cluster resource management, Scheduling

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies are notmade or distributed for profit or commercial advantage and that copies bearthis notice and the full citation on the first page. Copyrights for componentsof this work owned by others than ACM must be honored. Abstracting withcredit is permitted. To copy otherwise, or republish, to post on servers or toredistribute to lists, requires prior specific permission and/or a fee. Requestpermissions from [email protected] ’15, December 07-11 2015, Vancouver, BC, Canadac©2015 ACM. ISBN 978-1-4503-3618-5/15/12 ...$15.00.

DOI: http://dx.doi.org/10.1145/2814576.2814807.

1. INTRODUCTIONModern data centers are shifting to shared clusters where

the resources are shared among multiple users and frame-works [24, 14, 21, 3]. A key enabler for such shared clus-ters is a cluster resource management system which allo-cates resources among different frameworks. For example,Hadoop’s new generation platform—YARN (Yet AnotherResource Negotiator [24]) allows multiple data processingengines such as interactive SQL, real-time streaming, andbatch processing to share resources and handle data storedin a single platform in a fine-grained manner. Other similarplatforms include Apache Mesos used at Twitter [14] andproprietary solutions deployed at Google and Microsoft [3].

Current cluster schedulers typically utilize preemption tocoordinate resource sharing, achieve fairness and satisfy SLOsduring resource contention. For example, if high priorityjobs share the same cluster with low priority jobs and a re-source shortage occurs, these schedulers preempt the lowpriority jobs and give more resources to high priority jobs.The current mechanism to handle such preemption is to sim-ply kill the low priority jobs and restart them later whenresources are available. This simple preemption policy en-sures fast service times of high priority jobs and prevents asingle user/application from occupying too many resourcesand starving others; however, without saving the progress ofpreempted jobs, this policy causes significant resource wasteand delays the response time of long running or low prior-ity jobs. Our analysis of a publicly available Google clustertrace [25] found that 12% of all scheduled tasks were pre-empted. If these tasks are simply killed with no checkpoint-ing, it can result in up to a 35% loss in total cluster usage.Similarly, Microsoft reported that about 21% of jobs werekilled due to preemptive scheduling in its Dryad cluster [1].Long running, low priority jobs are also repeatedly killedand restarted in Facebook’s Hadoop cluster [5].

In this paper, we propose an approach that uses systemlevel, application-transparent suspend-resume mechanismsto implement checkpoint-based preemption 1 and reduce thepreemption penalty in cluster scheduling. Instead of killinga job or task, we suspend execution of running processes(tasks) and store their state (e.g., memory content) for re-sumption at a later time when resources are available. Toreduce the preemption overhead and improve performance,our approach leverages fast storage technologies such as non-volatile memory (NVM) and uses a set of adaptive preemp-tion policies and optimization techniques. We implement the

1We use suspend-resume and checkpoint-based preemptioninterchangeably.

222

proposed approach using the CRIU (Checkpoint/Restore InUserspace) [8] software tool with HDFS and PMFS [12] andintegrate our solution into Hadoop YARN [24].

The following key contributions differentiate our solutionfrom previous work.

• Using application-transparent checkpointing mech-anisms in cluster scheduling. Our method leveragesexisting work from application-transparent checkpointingmechanisms and uses them to implement non-killing pre-emption in cluster scheduling. It can be applied to a widerange of applications without needing to modify the appli-cation code. We evaluate the feasibility and applicabilityof our approach using Google cluster trace-driven simula-tion and real industry workloads with different configura-tions and scenarios.

• Adaptive preemption policies and optimization tech-niques. Application-transparent checkpointing mecha-nisms are typically expensive because they save the entirestate of a running application and dump it to disk whichmay trigger a lot of memory, I/O and network traffic. Toaddress these issues, we develop a set of adaptive preemp-tion policies to mitigate these suspend-resume overheads.The adaptive policies dynamically select victim tasks andthe appropriate preemption mechanisms (e.g., kill vs. sus-pend, local vs. remote restore) according to the progressof each task and its suspend-resume overhead. Insteadof dumping the entire memory region, memory usage istracked, and only those memory regions that were changedsince the last suspend are saved to reduce the checkpointsize and latency. The adaptive policies enable significantimprovement in application performance over the policythat always suspends or kills a job during preemption.

• Leveraging fast storage. Our approach can further re-duce the preemption overheads using emerging fast stor-age technologies such as non-volatile memory (NVM) [17].By efficiently storing application checkpoints on fast stor-age, our approach can quickly suspend and resume ap-plications and improve the efficiency of checkpoint-basedpreemption. Our prototype implements checkpoints withan NVM-based file system – PMFS (Persistent MemoryFile System) [12]. In our implementation, we leverage theCRIU software tool [8] to save checkpoints to an emulatedNVM-based file system using PMFS (Persistent MemoryFile System) [12]. Alternatively, we can use NVM as per-sistent memory (NVRAM) and copy checkpoint data fromDRAM to NVM using memory operations. This methodexploits NVM’s byte-addressability to avoid serializationand uses operating system paging and processor cacheto improve latency. To improve performance, a shadowbuffering mechanism can be used to explicitly handle vari-ables between DRAM and NVRAM. For example, updatesto DRAM can be incrementally written to NVM. Duringresumption, an attempt to modify the data would movethe data back from NVRAM to DRAM.

• Implementation with Hadoop YARN.We implementthe proposed non-killing preemptive scheduling and adap-tive preemption policies in Hadoop YARN – the new gen-eration Hadoop cluster resource manager. In particu-lar, we implement application-transparent checkpointingto suspend and resume preempted applications using CRIU.We extend CRIU to save checkpoints to HDFS so that

checkpointed tasks can restart from any node in the clus-ter. We conduct extensive experiments to evaluate the ap-plicability of our checkpoint-based preemption and com-pare it with YARN’s current kill-based preemption on dif-ferent storage devices: HDD, SSD and NVM.

We found that our approach can improve overall job re-sponse times by 30%, reduce resource wastage by 67% andlower energy consumption by 34% over the current kill-based preemption approach used in modern cluster sched-ulers. These savings can result in more total jobs beingscheduled, less energy consumption and reduced costs in thelong-term, which ultimately yields more profit.

The rest of paper is organized as follows. We motivateour work with a detailed study of a data center trace fromGoogle in Section 2. Section 3 presents our suspend-resumebased preemption approach and evaluation results. The op-timization policies and techniques are discussed in Section 4.The Hadoop YARN implementation and experimental re-sults are discussed in Section 5. Section 6 reviews relatedwork and Section 7 concludes the paper.

2. REAL-WORLD CLUSTER PREEMPTIONTo understand the impact of preemption in cluster schedul-

ing, we analyzed the publicly available cluster workload tracesfrom the Google data center [25]. This trace provides datafrom 12,500 machines for the month of May 2011. It con-tains cluster scheduler requests and actions for 672,000 jobs.

A job is composed of one or more tasks. Each task has ascheduling priority level from 0 to 11 and a scheduling classdescribing latency sensitivity (four latency levels). The traceincludes detailed task information such as per-task inter-arrival time, CPU/memory demand and usage over time,priority, latency sensitivity, and event type (e.g., submitted,scheduled, evicted or completed). In total, there are 144million task events during the 29-day trace.

Our goal is to understand the resource efficiency and per-formance impact of preemption using the Google clustertraces. Prior analysis [4] has shown that the task evictionevent in the trace (accounting for 93% of evictions) is pri-marily triggered by priority scheduling in Google’s clusterscheduler to handle task congestion or resource contention.For example, when a high priority job arrives and the avail-able cluster resources are not sufficient to meet its demand,active low priority jobs/tasks are evicted to release the re-sources to the higher priority job. Preempted tasks are auto-matically resubmitted to the scheduler and may experiencemultiple evictions before successfully finishing. In our study,we focus on scheduling events in the Google trace, specifi-cally submit, schedule, eviction and finish events. Accordingto the Google trace description, a task is evicted for a va-riety of reasons including preemption by a higher prioritytask or job, scheduler over-commitment whereby the actualdemand of a machine exceeds capacity, the machine whichthe task is running on becomes unusable, or the data on themachine becomes lost. To determine preemption, we use thefollowing criterion proposed in [4]: if a higher priority task isscheduled on the same machine within five seconds after thelower priority job was evicted, then we count that the lowerpriority job was preempted due to preemptive scheduling.

Figure 1a shows the percentage of scheduled tasks thatwere preempted over time during their execution. The re-sults shows that many low priority scheduled tasks were pre-

223

0 0.2 0.4 0.6 0.8

1

0 5 10 15 20 25

Pre

empt

ion

Rat

e [%

]

Time [Day]

Low PriorityMedium Priority

High Priority

(a) Preemption Rate Timeline

0

25.0

50.0

75.0

100.0

0 1 2 3 4 5 6 7 8 9 10 11

% o

f all

Pre

empt

ions

Priority(b) Preemption Rate Per Priority

0

250

500

750

1000

1 2 3 4 5 6 7 8 9>=10

Dis

tinct

Tas

ks[th

ousa

nds]

Num. of Preemptions(c) Preemption Frequency Distribution

Figure 1: Preemption in Google Trace.

empted during their execution. Table 1 summarizes the ag-gregated number of tasks and preemption rate for each pri-ority category. The results show that an average of 12.4%of scheduled tasks were evicted due to preemptive schedul-ing in the Google cluster and 20% of scheduled low prioritytasks were preempted. Figure 1b shows the preemption oflow priority tasks (i.e., 0-1 priorities) account for over 90%of the total preemptions. These tasks average four evictionsper task-day, and a 100-task job running at this priority willhave one task preempted every fifteen minutes [20]. Ad-ditionally, a single task could be scheduled and preemptedmultiple times as shown in Figure 1c. More than 43.5% ofpreempted tasks were preempted more than once, and 17%of these tasks were even preempted ten times or more.

Priority Num. of Tasks Percent PreemptedFree(0-1) 28.4M 20.26%Middle (2-8) 17.3M 0.55%Production (9-11) 1.7M 1.02%

Table 1: Preempted Tasks with Different Priorities.

Without a proper mechanism to save the progress of pre-empted tasks, compute resources such as CPU, memory andpower will be wasted due to repeated execution of these pre-empted tasks. Frequent and repetitive preemption causeseven more resource wastage. We analyzed the impact ofpreemption on resource wastage in Google trace and foundthat kill-based preemption could result in a huge amountof resource wastage. If we assume that the scheduler sim-ply kills the preempted tasks and there is no mechanism tosave the progress of a preempted task, 130k CPU-hours (upto 35% of total usage) could have been wasted during thetrace period due to preemptive scheduling. The amount ofresources wasted is estimated as the amount of CPU timespent on unsuccessful execution of tasks, i.e., the CPU timebetween schedule and preempt events.

Further, although most of the tasks preempted are lowpriority tasks, we find that tasks bound by latency werealso preempted. Table 2 summarizes the number of sched-uled tasks and the percentage of preempted tasks for eachlatency sensitivity level. The result shows that a large num-ber of highest latency-sensitive tasks (14.8%) were still pre-empted. This can have a significantly negative impact ontask performance and application QoS.

We also found similar issues reported with preemptivescheduling in Facebook and Microsoft’s shared clusters run-ning big data applications [1, 5]. In Facebook’s 600 nodeHadoop cluster, 3% of its jobs needed map slots that ex-

Latency Sensitivity Num. of Tasks Percent Preempted0 (lowest) 37.4M 11.76%1 5.94M 18.87%2 3.70M 8.14%3 (highest) 0.28M 14.80%

Table 2: Preempted Tasks with Different Latency Sensitiv-ities

ceeded 50% of the cluster’s capacity and 2% of its jobs hadmap tasks that exceeded the capacity of the entire cluster.During peak times, a large production job would arrive every500 seconds and kill all low priority map tasks [5]. Duringthese busy periods, these jobs are repeatedly killed, wast-ing a significant amount of cluster resources. Similarly, Mi-crosoft reported that roughly 21% of jobs were killed due topreemptive scheduling [1].

In summary, our analysis of production workloads showsthat kill-based preemption in shared cluster scheduling re-sults in significant resource wastage and performance loss.

3. CHECKPOINT-BASED PREEMPTIONIn this paper, we propose the use of an application-transparent

suspend-resume mechanism to implement checkpoint-basedpreemption. This improves current preemption policies andmechanisms in cluster scheduling and reduces resource wastage.

3.1 System ModelWe consider a cluster consisting of many nodes running

jobs across multiple frameworks, applications and users. Eachnode has a set of computing resources including CPU, mem-ory, storage, I/O and network bandwidth. Each job consistsof multiple tasks that are scheduled to run on nodes by ascheduler based on their resource demand and schedulingpolicies. Tasks can share resources on nodes and achieveperformance isolation via “containers” or “slots”.

A cluster scheduler is in charge of scheduling the tasks ofsubmitted jobs and managing task resources. Users submitjobs to a queue in the cluster and each job has a schedul-ing priority and resource requirement (amount of CPU andmemory it needs). In particular, the scheduler assigns ajob’s tasks to specific nodes for execution. When there areidle resources, the cluster scheduler can give a job’s tasksthese resources in excess to its capacity to improve clus-ter utilization. When a new job arrives and there are nomore resources available, the scheduler chooses active jobsthat are either of lower priority (priority scheduling) than

224

the arriving job, or jobs that are using more resources thantheir fair share (fair-share scheduling) or guaranteed capac-ity (capacity scheduling). The tasks of the selected jobs arethen preempted to release their occupied resources. Mul-tiple scheduling policies—such as priority, fair-sharing andcapacity scheduling—can be employed. To simplify the dis-cussion but without loss of generality, we assume priorityscheduling is used in the rest of the paper.

The model described above is generic and employed bymany frameworks such as Google’s Omega [21], HadoopYARN [24], Mesos [14] and Dryad [15].

3.2 Checkpoint-based PreemptionMost cluster schedulers preempt a job or task by simply

killing it. Alternatively, we propose to save the progress ofa preempted task by suspending or checkpointing its stateand resuming it later when resources are available.

3.2.1 Application-transparent Suspend-ResumeWhile application-specific checkpointing mechanisms have

been proposed in prior work such as [1, 6], we focus on theuse of application-transparent checkpoint suspend-resumemechanisms such as CRIU (Checkpoint/Restore in Userspace)and OS checkpoint mechanisms (e.g., SIGSTOP/SIGTSTP/SIGCONT). These mechanisms suspend and checkpoint arunning application as a collection of files. The suspendedapplication can then be resumed at any time and return tothe point it was suspended. Typically, suspending an appli-cation involves collecting and dumping the entire name spaceinformation to files on disk, including kernel objects, processtree via ptrace, /proc, netlinks, syscalls, signals, CPU regis-ter sets, and memory content. To restore a suspended pro-cess, the process tree is rebuilt from the saved information,pipes are restored and the memory mapping is recreated.

We implement suspend-resume-based preemption usingCRIU [8]. CRIU is an open-source Linux software tool thatsupports checkpoint-restore processes on x86 64 and ARMand works on unmodified Linux-3.11+ included in Debian,Fedora, Ubuntu, etc. It has been tested for many applica-tions including Java, Apache, MySQL and Oracle DB andintegrated with LXC/Docker/OpenVZ containers.

Our cluster scheduler uses CRIU to suspend a preemptedtask and adds it back to the submission queue. The resub-mitted task includes the information about the task’s cur-rent progress, checkpoint location, etc. When a suspendedtask is scheduled, the scheduler runs a CRIU restore andresumes the task from the saved state.

3.2.2 Distributed Suspend-ResumeCRIU supports checkpoints only on the local file system

due primarily to potential name conflicts on a remote node.We enhance CRIU to save checkpoints on distributed file sys-tems. In particular, we extend CRIU to work with HDFSto support remote suspend-resume. This enables more flex-ible scheduling by resuming a suspended task on any avail-able node. We achieve this by leveraging libhdfs. Instead ofdumping checkpointed data to local file buffers, we performa write and flush to a system-specified directory in HDFS.Similarly during restore, CRIU reads the contents of check-pointed data from HDFS instead of the local file system. Ad-ditionally, some process information (e.g., linked files) thatis originally checkpointed is modified to make resumptionpossible on a remote node. This way, remote resumption is

completely handled by HDFS without worrying about themigration and replication of checkpointed data.

3.2.3 Suspend-Resume with NVMCheckpointing a task can cause overhead, especially if

written to slow HDD devices. To reduce the overhead, weleverage fast storage technologies such as SSD and also emerg-ing byte-addressable non-volatile memory (NVM) technolo-gies [17]. By efficiently storing application checkpoints onfaster storage devices, we can implement fast mechanismsto suspend-resume applications at runtime.

NVM can be used as a fast disk with file system interfacesor as virtual memory. Accordingly, there are two ways tosave checkpoints in NVM. The first is to use NVM as fastdisks and save the checkpoints (images) in NVM-based filesystems such as Intel PMFS (Persistent Memory File Sys-tem) [12] or BPFS [7]. PMFS is a light-weight kernel-levelfile system and provides byte-addressable, persistent mem-ory to applications via CPU load/store instructions. PMFSoffers low-overhead using a variety of techniques. It avoidsthe block device layer by using byte-addressability and map-ping persistent memory pages directly into an application’smemory space. We leverage PMFS in our prototype andevaluations to emulate an NVM-based file system. To sup-port suspend/resume in distributed environments, we use alocal PMFS mounted directory as the HDFS data storage.To use PMFS with HDFS, we pre-allocate a contiguous areaof DRAM before the OS boots for use as the file systemspace. Then, we mount PMFS by pointing it to the mem-ory address of the starting region and specifying the totalsize of the file system. The PMFS-mounted directory canthen be used by HDFS. In our prototype, CRIU saves thecheckpoints via the HDFS interface; HDFS, in turn, storesit in PMFS across multiple nodes.

Alternatively, we can use NVM as virtual memory (i.e.,NVRAM). This method exploits NVM’s byte-addressabilityto avoid serialization and uses OS paging and processorcache to improve latency. In this case, checkpointed datais copied from DRAM to NVM using memory operations.To improve performance, a shadow buffering mechanism canbe used to explicitly handle variables between DRAM andNVRAM [16]. Updates to DRAM can be incrementally writ-ten to NVM. During resumption, an attempt to modify thedata would move the data back from NVRAM to DRAM.Our current prototype has not yet integrated the mecha-nisms for using NVM as virtual memory for checkpointing,but it is a topic for our future work.

3.3 Evaluation

3.3.1 Suspend-Resume OverheadThe overhead of suspend-resume is mainly determined by

the storage media performance (i.e., I/O bandwidth) andthe application’s memory size. We run experiments to eval-uate the overhead of our application-transparent, suspend-resume mechanism on different storage media. We suspendand resume a program, which allocates and fills a specifiedsize of memory and performs a simple computation. We varythe program’s memory size and measure the time needed tosuspend and resume the program on different storage me-dia: HDD, SSD and NVM (PMFS, in this case). Our ex-periment machine has two Xeon 5650 CPUs, 96GB RAM,500GB HDD and a 120GB SSD (OCZ Deneva 2). The re-

225

0 100 200 300 400 500 600

0 1.0 2.5 5.0 7.5 10.0

Tota

l Dum

p/R

esto

reTi

me

[s]

Checkpoint Size [GB]

HDDSSDNVM

(a) Local File System

0 100 200 300 400 500 600

0 1.0 2.5 5.0 7.5 10.0

Tota

l Dum

p/R

esto

reTi

me

[s]

Checkpoint Size [GB]

HDDSSD

PMFS

(b) HDFS

Figure 2: Suspend and Restore Performance on Local FS and HDFS.

0 500

1000 1500 2000 2500 3000 3500

Was

ted

CP

UC

apac

ity [c

ore-

hour

s]

Preempt Method

KillChk-HDDChk-SSDChk-NVM

(a) Resource Wastage

3800 3850 3900 3950 4000 4050 4100

Pow

er [k

Wh]

Preempt Method(b) Energy Consumption

0 0.2 0.4 0.6 0.8

1 1.2 1.4

LowPriority

MediumPriority

HighPriority

Nor

mal

ized

Res

pons

e Ti

me

(c) Performance

Figure 3: Google Trace-driven Simulation: Comparison of Different Preemption Policies.

sults on the local file system are shown in Figure 2a. Thetime of suspending and resuming the program is linearly cor-related with the program’s memory footprint. The SSD isapproximately 3-4x times faster than the HDD, and NVMis 10-15x faster than SSD.

The results on HDFS are shown in Figure 2b. Similarto the local file system, the suspend and restore time ismostly linearly correlated with the memory size, but it takesmore time to finish compared to the local file system dueto the overhead added by HDFS. Compared with the sus-pend/resume on a local file system, suspend/resume withHDFS enables a suspended task to start on any node. Hence,it enables the scheduler to schedule the task earlier and mayactually reduce the overall response time.

These results show that the suspend-resume overhead variessignificantly depending on the job size and storage perfor-mance. The overhead can be high for jobs with large mem-ory footprints (e.g., memory intensive applications) or onslow storage such as HDD. The benefit of suspend-resume-based preemption will depend on the I/O performance andworkload characteristics. This raises the question: Is theproposed suspend-resume-based preemption actually benefi-cial for real workloads and feasible in practice? To answerthis, we conduct experiments via Google cluster trace-drivensimulation and with real applications.

3.3.2 Google Trace-driven SimulationWe develop a trace-driven cluster scheduling simulator. It

follows the system model detailed in Section 3.1 and imple-ments different scheduling and preemption policies. We usea one-day job trace data from the Google cluster trace inour simulation. The one-day trace contains approximately15,000 jobs (totaling over 600,000 tasks) requiring over 22,000cores. The jobs are split into three priority levels and pre-

emption decisions made by the scheduler are based on eachjob’s priority level. The system performance parameters—such as I/O bandwidth and checkpoint overhead—on dif-ferent storage media are populated with the measurementsobtained in Section 3.3.1.

We evaluate four policies. The kill-based policy kills lowerpriority jobs during preemption. The other three policiescheckpoint preempted tasks by saving the tasks’ states todifferent storage media (HDD, SSD and NVM) and resumethem later when resources are available. Figure 3 shows re-source wastage (e.g., the amount of CPU-time wasted due torepeatedly killing jobs, and from preemption and checkpointoverhead), the energy consumption and the job performance(job response time normalized to that of the kill-based pre-emption) using the four different policies. A job’s responsetime is defined as the total time the job spent queueing, plusthe actual job execution time.

The kill-based preemption, which is used by most clus-ter schedulers, wastes about 3,400 CPU-core hours (about35% of the total capacity) by killing low priority jobs toreclaim resources for higher priority jobs. Compared to kill-based preemption, checkpoint-based preemption reduces theresource wastage to 14.6%, 11.1% and 8.5% on HDD, SSDand NVM, respectively. This reduced resource wastage im-plies more jobs can be scheduled in the same time periodand lead to cost savings.

Energy consumption was calculated by taking the averageCPU utilization of each machine, converting it to a corre-sponding wattage and multiplying it by the total experimenttime. Based on this calculation, checkpoint-based preemp-tion on HDD and SSD is similar to kill-based preemption,but the checkpoint-based approach on NVM reduces the en-ergy consumption by about 5%.

As far as performance is concerned, checkpoint-based pre-

226

0 0.5

1 1.5

2 2.5

3 3.5

1.0 2.0 3.0 4.0 5.0

Nor

mal

ized

Res

pons

e Ti

me

Checkpoint Bandwidth [GB/s]

WaitKill

Checkpoint

(a) High Priority Job Performance

0 1 2 3 4 5 6

1.0 2.0 3.0 4.0 5.0

Nor

mal

ized

Res

pons

e Ti

me


WaitKill

Checkpoint

(b) Low Priority Job Performance

0 0.5

1 1.5

2 2.5

3 3.5

4

1.0 2.0 3.0 4.0 5.0

Nor

mal

ized

Ene

rgy

Con

sum

ptio

n


WaitKill

Checkpoint

(c) Energy Consumption

Figure 4: Comparison of Different Policies with Varying I/O Bandwidths.

emption using HDD gives low priority jobs better perfor-mance than preempt-kill, but performance for medium andhigh priority jobs is worse due to the substantial checkpoint-ing overhead. Checkpointing on SSD offers comparable per-formance for high priority jobs to the preempt-kill policyand also better performance for low priority jobs. The per-formance of medium priority jobs is slightly worse than kill-based preemption. If we use an NVM-backed file system,the response times of both low and medium priority jobsare reduced significantly (by 74% and 23%, respectively),while achieving similar performance for high priority jobs.

In summary, checkpoint-based preemption can significantlyreduce resource wastage even with slow storage like HDD,although there is a performance penalty for medium andhigh priority jobs. As we use faster storage such as SSD, thepenalty becomes much smaller. With fast NVM, checkpoint-based preemption can reduce resource wastage and energyconsumption, and improve the performance of low and mediumpriority jobs, while achieving comparable performance forhigh priority jobs; however, there is a non-negligible per-formance penalty for higher priority jobs associated withcheckpoint-based preemption using slow storage. To furtherunderstand the effectiveness and feasibility of application-transparent, checkpoint-based preemption and the impactof storage performance, we conduct the following sensitivityanalysis.

3.3.3 Sensitivity Analysis with Real ApplicationsThe experiment involves two jobs each running a simple

k-means program [9] with a one-minute execution time and5 GB memory size. The two jobs run on a real machine withthe following scenario. A low priority job starts executingfor 30s before a high priority job arrives and preempts it.We compare three different preemption policies with differ-ent I/O bandwidth. In the first policy wait, the high priorityjob waits for the low priority job to finish before executing.In the second policy kill, the low priority job is immediatelykilled in favor of the high priority job and restarts its exe-cution from scratch when the high priority job has finished.In the third policy preempt-checkpoint, the low priority jobis suspended by saving its progress and the high priority jobstarts executing after the checkpointing is finished. Once thehigh priority job completes, the low priority job is restoredfrom the state it was checkpointed and continues execution.Varying the I/O bandwidth is accomplished by saving check-points in PMFS and changing the value of the thermal con-trol register that is available in Intel Xeon E5-2650 CPUs,which throttles the memory bandwidth to emulate different

I/O performance.Figures 4a and 4b shows the normalized performance re-

sults for the high priority and low priority jobs for each ofthe three policies with varying storage media bandwidth.For the high priority job, killing the low priority job alwaysyields the best performance, while waiting for the low pri-ority job to finish increases its response time by more thanone-half. When I/O bandwidth is slow, checkpointing thelow priority job actually yields worse response time thankilling it and restarting from scratch. As the I/O band-width increases, checkpoint-based preemption yields betterperformance. The response times are comparable to the kill-based policy when the storage bandwidth is very fast, e.g,using NVM. We also measure the energy consumption basedon the total response time of both jobs as shown in Fig-ure 4c. The wait policy yields the best energy consumptionsince no CPU cycles are wasted, while the kill policy wastesCPU resources and consumes more energy. Checkpoint-based preemption results in higher energy consumption withslow storage than the kill policy.

These results confirm our observations from section 3.3.1that the effectiveness of checkpoint-based preemption de-pends on the storage performance and job properties, andthat checkpointing may not always be beneficial. Whenthe checkpointing overhead is low (e.g., with fast storage orsmall job memory footprint), checkpoint-based preemptioncan improve performance and energy efficiency; however,when the checkpointing overhead is expensive (e.g., check-pointing large jobs on slow storage), the overhead cost mayoutweigh the benefit and make checkpoint-based preemptionworse than simple kill or wait-based policies. This obser-vation motivates the idea of using an adaptive preemptionpolicy, which dynamically chooses an appropriate preemp-tion mechanism conditional on the checkpointing overhead.We discuss optimizations to the basic checkpoint-based pre-emption in Section 4.

4. OPTIMIZATION

4.1 Adaptive Policies and AlgorithmsAs discussed in Section 3.3.3, the challenge of using application-

transparent checkpointing mechanisms is that they can beexpensive with slow storage and large jobs because suchmechanisms typically collect and save the entire state of run-ning processes and memory content and dumps it to a stor-age device. Dumping a task’s full state may trigger a lot ofmemory and I/O (and possibly network traffic if checkpoint-ing for remote resumption) and delay the relinquishment of

227

resources to high priority and critical workloads. Further, itcan degrade other active tenant applications during check-pointing. Naive use of such methods to suspend and resumeapplications in cluster scheduling with slow storage devicescan be detrimental to some jobs’ performance (e.g., highpriority, production jobs).

To address these issues, we propose a set of adaptive poli-cies to minimize the preemption penalty. This will improveapplication performance in cluster scheduling by choosingproper victim tasks and preemption mechanisms based onstorage media performance (i.e., I/O bandwidth), workloadprogress and checkpoint/restore overhead. We also proposeto use optimization techniques such as incremental check-pointing to reduce the overhead.1. Adaptive preemption dynamically selects victim tasksand preemption mechanisms (checkpoint or kill) based onthe progress of each task and its checkpoint/restore over-head. Specifically, the total checkpointing overhead is es-timated as the sum of checkpointing and restoring a task,plus the queueing time to checkpoint. The time of check-pointing and restoring a task is estimated according to thecheckpoint size and I/O bandwidth (size/bandwidth). Ifother checkpoint operations are occurring on the machine,the queueing time is how long the task needs to wait forother checkpoint operations to finish before it can dump itsown state to storage. This total overhead is compared withthe current progress of the task. If the progress exceeds thetotal checkpointing overhead, the task is checkpointed. Oth-erwise, the application is simply killed. The pseudo-code forour preemption algorithm is shown in Algorithm 1.

Algorithm 1: Preemption Algorithm

overheadchkpt =size

bwwrite+ size

bwread+ queue timedump

candidate victims = get_candidate_victims();sort(candidate victims);for Task t in candidate victims do

if t.progress > t.checkpoint overhead thenif t.previous checkpoint ! = null then

do_incremental_checkpoint(t);else

do_normal_checkpoint(t);end

elsekill(t);

end

end

2. Adaptive resumption restores preempted jobs/taskswhen resources are available according to their overheadswhich are calculated based on the checkpoint size, availablenetwork and I/O bandwidth, etc. We use HDFS to storecheckpoints, and hence a preempted task can be scheduledon a local or remote node. It may seem that the local restoreoverhead will always be lower than the overhead of remoterestore, but there can be extra costs for local restore depend-ing on whether the restoring task will need to preempt otherrunning tasks or if it needs to wait in the preemption queuefor other checkpoint/restore operations to complete. Thepseudo-code for our resumption algorithm is shown below.3. Incremental checkpointing is used to checkpointmodified memory regions only. A task may be suspendedmultiple times; for subsequent preemption after the firstcheckpoint, we only need to checkpoint the task’s memory

Algorithm 2: Resumption Algorithm

overheadlocal =size

bwread+ queue timelocal

overheadremote = sizebwnet

+ sizebwread

+ queue timeremote

preempted tasks = get_preempted_tasks();for Task t in preempted tasks do

if t.previous checkpoint == null thenrestart_task(t);

elseif t.local resume overhead <=t.remote resume overhead then

do_local_resume(t);else

do_remote_resume(t);end

end

end

regions that have been modified since the last checkpoint.This can significantly reduce checkpoint size and latency, es-pecially for read-dominant workloads. CRIU supports suchincremental checkpoints with memory change tracking byleveraging soft-dirty bits in the page table. A soft-dirty bittracks which pages a task writes to. When first enablingincremental checkpoints for a task, CRIU clears all the soft-dirty bits and writable bit from the task’s page table entries.Subsequently, if the task tries to write to a portion of itspage, a page fault occurs and the kernel sets the soft-dirtybit for the corresponding page table entry. If the task needsto be dumped again after its initial checkpoint, it will onlyneed to dump the pages which have its soft-dirty bit set.Table 3 shows the results of checkpointing a program with5 GB memory twice. 10% of the memory region is modifiedbetween the first checkpoint and the second one. As we cansee, the second checkpoint operation is a magnitude fasterthan a full dump for all three storage media. Our preemp-tion utilizes incremental checkpointing whenever possible toreduce the overhead. Similarly, depending on the amount ofresources that need to be released, the entire task memorypartition, or only a portion of it, needs to be checkpointed.For example, to reclaim resources for a CPU-intensive job,we only need to suspend the running job and dump a portionof its memory region.

Storage First Checkpoint Second CheckpointHDD 169.18s 15.34sSSD 43.73s 4.08sPMFS 2.92s 0.28s

Table 3: Benefits of incremental checkpointing.

4.2 Benefits of Adaptive Policies

4.2.1 Google-trace driven SimulationWe integrate the adaptive policies into the trace-driven

simulator described in Section 4.1 and evaluate them usingthe one-day job trace from the Google cluster traces similarto Section 3.3.2. Figure 5 shows the performance (responsetime normalized to the basic policy) using adaptive preemp-tion and basic checkpoint-based preemption which alwayscheckpoints a preempted job. The result shows that theadaptive policy is very effective and improves the perfor-

228

0 0.2 0.4 0.6 0.8

1

LowPriority

MediumPriority

HighPriority

Nor

mal

ized

Res

pons

e Ti

me

Basic Adaptive

(a) HDD

0 0.2 0.4 0.6 0.8

1

LowPriority

MediumPriority

HighPriority

Nor

mal

ized

Res

pons

e Ti

me

(b) SSD

0 0.2 0.4 0.6 0.8

1

LowPriority

MediumPriority

HighPriority

Nor

mal

ized

Res

pons

e Ti

me

(c) NVM

Figure 5: Performance Improvement with Adaptive Policies.

mance for all three types of jobs, in particular on slowerstorage like HDD and SSD. The response times of low prior-ity jobs on HDD, SSD and NVM are reduced by 36%, 12%and 3%, respectively. The response times for medium pri-ority are reduced by 55%, 17%, and 8% on HDD, SSD andNVM, respectively. Adaptive policies also help improve thehigh priority job performance on HDD and SSD by 29% and8% respectively. The high priority job performance usingNVM is comparable to the kill-based policy’s performance,the best possible performance for high priority jobs.

Our experiment results show that the adaptive approachalso reduces energy consumption for all three storage mediacompared to basic checkpoint-based preemption. We omitthis graph due to space constraints.

4.2.2 Sensitivity Analysis with Real ApplicationsWe further evaluate and compare different policies with

varying I/O bandwidths using real applications. The exper-iment setup and scenario are the same as the one describedin Section 3.3.3.

Figures 6a and 6b shows the performance results for highpriority and low priority jobs for each of the four policies(wait, kill, always checkpoint, adaptive) while varying thecheckpointing bandwidth. As we discussed in Section 3.3.3,the basic policy that always chooses to checkpoint a jobis not beneficial at low bandwidths and results in perfor-mance even worse than just killing the job. The adaptivepolicy chooses to kill the low priority job at low checkpoint-ing bandwidths, but chooses to checkpoint the low priorityjob when the checkpointing bandwidth is higher. As a re-sult, the performance of the high priority job is never worsethan the wait approach. As the available I/O bandwidthincreases, the performance approaches the kill-based policy.Similarly, the adaptive policy achieves better performancethan the basic always-checkpoint preemption policy at lowbandwidths and obtains comparable performance to the waitpolicy at high bandwidths.

The energy consumption results are shown in Figure 6c.The basic checkpoint-based preemption policy can result inhigher energy consumption at lower bandwidths than the killpolicy. By contrast, the energy consumption of the adaptivepolicy is never worse than the kill policy and is similar tothe wait policy at higher bandwidths.

5. HADOOP YARN IMPLEMENTATIONWe have integrated the proposed checkpoint-based pre-

emptive scheduling and optimization policies into HadoopYARN. We describe the details of the implementation below

and also compare our system with YARN’s current kill-basedpreemption for the DistributedShell application on differentstorage devices: HDD, SSD and NVM.

5.1 Overview of Hadoop YARNYARN is the next generation cluster resource manager

for the Hadoop platform that allows multiple data process-ing frameworks—such as MapReduce, Spark [26], Storm,HBase, etc.—to dynamically share resources and data in asingle shared cluster. YARN uses a global resource sched-uler (YARN ResourceManager - RM) to arbitrate resources(CPU, memory, etc.) among application frameworks basedon configured per-framework resource capacities and schedul-ing constraints. A per-application YARN ApplicationMas-ter (AM) requests resources from the RM and chooses whattasks to run. It is also responsible for monitoring and schedul-ing tasks within an application.

The YARN ResourceManager supports capacity schedul-ing and fair scheduling. The scheduler allocates resourcesin the form of containers to applications based on capacityconstraints, queues and priorities. Like other popular clusterschedulers, YARN scheduler relies on preemption to coordi-nate resource sharing, guarantee QoS and enforce fairnessas follows. When a new job or new container request arrivesand there is resource contention, the YARN ResourceMan-ager determines what is needed to achieve capacity balanceand selects victim application containers according to pre-defined policies (e.g., capacity sharing or priority schedul-ing). The ResourceManager then sends a request to thosecontainers’ ApplicationMasters to terminate the containersgracefully and, as a last resort, sends a request to the con-tainers’ NodeManagers to terminate them forcefully.

5.2 Architecture and Implementation

5.2.1 Checkpoint-based PreemptionFigure 7 shows the software architecture of our checkpoint-

based preemption implementation on YARN. Preemptionand checkpointing occurs in YARN in the following man-ner: (1) a new job or ApplicationMaster requests resourcesfrom the ResourceManager. (2) When there is resource con-tention, ResourceManager requests for an ApplicationMas-ter to terminate its application container(s) so that resourcescan be returned and given to an application with higher pri-ority by dispatching a ContainerPreemptEvent. The Con-tainerPreemptEvent specifies a particular ApplicationMas-ter and the containers to preempt. By default, the AM doesnot handle this event, so a container managed by the AMwill be forcefully killed by the NodeManager after a certain

229

0 0.5

1 1.5

2 2.5

3 3.5

1.0 2.0 3.0 4.0 5.0

Nor

mal

ized

Res

pons

e Ti

me


WaitKill

CheckpointAdaptive

(a) High Priority Job Performance

0 1 2 3 4 5 6

1.0 2.0 3.0 4.0 5.0

Nor

mal

ized

Res

pons

e Ti

me


WaitKill

CheckpointAdaptive

(b) Low Priority Job Performance

0 0.5

1 1.5

2 2.5

3 3.5

4

1.0 2.0 3.0 4.0 5.0

Nor

mal

ized

Ene

rgy

Con

sum

ptio

n


WaitKill

CheckpointAdaptive

(c) Energy Consumption

Figure 6: Comparison of Different Policies with Varying I/O Bandwidths.

YARN ApplicationMaster

Application Preemption Manager

2. Preemption Request

3. Suspend6. Resume

3. Suspend6. Resume

4. Suspend Complete

n YARN Resource Manager

YARN Cluster Scheduler5. Container Request

YARN NodeManager

HDD, SSD, NVM (PMFS)

HDFS

Task

CRIU

Taskdump restore

YARN NodeManager

HDD, SSD, NVM (PMFS)

HDFS

Task

CRIU

Taskdump restore

1. New Job

Figure 7: YARN Architecture.

timeout. (3) We implemented a new preemption managerfor the AM (in our current implementation we modify theDistributedShell ApplicationMaster) to handle the Contain-erPreemptEvent so that when such an event arrives, thepreemption manager can then make a preemption decisionbased on the specified preemption policy (discussed in thesection below). For example, instead of killing the container,the AM can suspend the task running on the container us-ing the CRIU dump command and save the state of thecontainer to the Hadoop Distributed File System (HDFS).(4) Once the checkpoint data has been successfully savedto HDFS, the resources of the checkpointed task can be re-claimed by the RM. The ApplicationMaster notifies the RMof the newly available resources. (5) The ApplicationMas-ter also submits a new request to the RM to allocate a newcontainer for the checkpointed task when resources are avail-able. (6) Once resources are available, the RM allocates anew container for the ApplicationMaster and the AM issuesa command to restore the saved checkpoints from HDFS andto resume computation from the saved state.

In our prototype, we validated the above steps by im-plementing it for the DistributedShell ApplicationMaster,which is included by default in the YARN distribution. Anew component, the Preemption Manager, is added to theDistributedShell ApplicationMaster that supports checkpoint-ing during preemption. The DistributedShell runs a shellcommand (or any program) on a set of containers in a dis-tributed and parallel manner. The DistributedShell AM firstrequests a set of resources for containers from the RM andspecifies a priority level for the request. Once the resourcerequest is granted, it will start running the command onthe container. The DistributedShell AM also monitors eachcontainer and has the functionality to re-run a container if

it has failed or has been killed. Once each container has fin-ished running the command, the AM will finish and returnthe resources back to the RM. In our scenario, in case of aresource insufficiency, the DistributedShell AM will check-point existing containers and free up resources. On restore,instead of issuing a new shell command, the checkpointedstate is retrieved and computation resumed.

5.2.2 Adaptive Policies ImplementationWe implemented the adaptive checkpoint-based preemp-

tion and resumption algorithms described in Section 4:

• Checkpoint cost-aware eviction. Cost-aware evictionis implemented in the ResourceManager. The RM cal-culates the checkpointing time for each candidate victimcontainer by dividing the memory size of each containerby the checkpointing bandwidth available for that node.Then, the ResourceManager selects the containers withthe lowest ratios and sends a ContainerPreemptEvent tothose ApplicationMasters to be checkpointed.

• Adaptive preemption. When an ApplicationMaster re-ceives a ContainerPreemptEvent, it will calculate its esti-mated checkpoint dump and restore time. If this time isgreater than the current progress of the task on the con-tainer, the ApplicationMaster will just issue a kill com-mand to the container instead of checkpointing it. Afterthe container is successfully killed, the ApplicationMasterwill request resources from the RM for a new container tore-run the killed task.

• Incremental checkpointing with memory trackers.We implement this by enabling CRIU to track the soft-dirty bit of tasks that have been resumed from check-pointed data. Subsequently, if any of these tasks are pre-empted again, only regions which have been modified needto be checkpointed again.

• Cost-aware remote resumption. Our implementationsupports both local and remote resumption. A check-pointed task can specify a preference for local resume,remote resume or no preference. If there is no preference,when there are enough resources to run the checkpointedtask, the ResourceManager chooses an available node andmissing blocks of checkpointed data are sent to the newnode before restoring the task.

• Our implementation uses sequential checkpoint/restoreto limit the number of concurrent checkpoints on each

230

0

50

100

150

200C

PU

Was

tage

[cor

e-ho

urs]

Preempt Method


(a) Resource Wastage

0

2

4

6

8

10

Pow

er [k

Wh]

Preempt Method(b) Energy Consumption

0 2 4 6 8

10 12 14 16

Low Priority High Priority

Res

pons

e Ti

me

[min

]

(c) Performance

Figure 8: Comparison of Different Preemption Policies on YARN.

node to minimize the interference. The RM maintainsa list of checkpoint queues for each node. When the RMsends a ContainerPreemptEvent to an AM, it will add thecontainers preempted to their nodes’ checkpoint queues.When the RM acquires the resources from preempted con-tainers, it removes those containers from their respectivequeues. When calculating the checkpointing overhead, theRM takes into account how many containers are in eachnode’s checkpointing queue.

5.3 Evaluation

5.3.1 Kill-based vs. Checkpoint-based Preemption

0

0.25

0.5

0.75

1.0

0 5 10 15 20 25 30Response Time [min]


Figure 9: YARN Workload Job Performance CDF.

We evaluated and compared our checkpoint-based pre-emption with Hadoop YARN’s current kill-based preemp-tion on three different storage devices: HDD, SSD, andNVM in an eight node Hadoop cluster (node specificationsdescribed in Section 3.3.1). Each node can support 24 con-current containers each with 1 CPU core and 2 GB of mem-ory with 48 GB of NVM. We used a workload derived froma Facebook trace [6] which contains 40 jobs (requiring 7,000tasks). The jobs are split into either low priority or highpriority. These two types of jobs are co-located and dy-namically share the resources in the YARN cluster via Dis-tributedShell. Each task runs a k-means machine learningprogram [9] that has a maximum memory footprint of ap-proximately 1.8GB.

Figure 8 shows total resource wastage in terms of CPUtime, total energy consumption and average job responsetime (i.e., the elapsed time between submission and com-pletion time). The current YARN scheduler wastes about28% of the total capacity in terms of CPU time by killinglow priority jobs to reclaim resources to high priority jobs.Compared to kill-based preemption, our approach reduces

the resource wastage by 50% and 65% on HDD and SSD,respectively. This reduced resource wastage may lead tomore jobs being scheduled and increased energy savings inthe long run. In particular, our approach reduces the energyconsumption by 21% and 29% on HDD and SDD, respec-tively. If we use an NVM-based file system (PMFS in thiscase), the reductions of resource wastage and energy con-sumption go up to 67% and 34%, respectively.

The response time CDF shown in Figure 9 shows thatoverall job performance is improved with checkpoint-basedpreemption over the kill-based approach and using NVMcan achieve better performance. In terms of average per-formance, checkpoint-based preemption reduces the averageresponse time of low priority jobs by 18% and 53% on HDDand SSD, respectively; however, the performance of highpriority jobs with checkpointing on HDD and SSD is worsethan the kill-based approach. By using fast checkpoint withNVM, response time of low priority jobs is reduced by 61%while the performance of high priority jobs is comparable tokill-based preemption.

5.3.2 Benefits of Adaptive PreemptionWe ran another experiment to compare the basic checkpoint-

based preemption that always checkpoints a job with ouradaptive preemption, which leverages our optimized policies.The average response time is shown in Figure 10. Adaptivepreemption reduces the response times of low priority jobsby 28%, 16% and 20% over the basic checkpoint-base pre-emption on HDD, SSD and NVM, respectively. The per-formance improvement for high priority jobs is 7%, 8% and14%. With the improvement, checkpoint-based preemptionwith NVM achieves similar performance for high priorityjob as the kill-based preemption while significantly improv-ing low priority job performance and reducing resource andenergy usage. Figure 11 shows the response time CDF ofadaptive preemption and basic checkpoint-based preemp-tion. Adaptive preemption improves the overall job perfor-mance on all three storage medias over the basic checkpoint-based preemption.

We also conducted a sensitivity analysis with our YARNimplementation similar to Section 3.3.3 and achieved sim-ilar results. The adaptive policy is never worse than thebasic policy and can achieve optimal performance and re-source efficiency with fast storage such as NVM. These re-sults demonstrate that the adaptive policy is a useful tech-nique to improve checkpoint-based preemption.

5.3.3 Overhead of Checkpoint-based Preemption

231

0 2 4 6 8

10 12 14


Res

pons

e Ti

me

[min

]Basic

Adaptive

(a) HDD

0 1 2 3 4 5 6 7 8


Res

pons

e Ti

me

[min

]

(b) SSD

0 1 2 3 4 5 6 7


Res

pons

e Ti

me

[min

]

(c) NVM

Figure 10: Performance Comparison of Basic Checkpoint-based Preemption vs. Adaptive Preemption.

0

0.25

0.5

0.75

1.0


KillBasic

Adaptive

(a) HDD

0

0.25

0.5

0.75

1.0


KillBasic

Adaptive

(b) SSD

0

0.25

0.5

0.75

1.0


KillBasic

Adaptive

(c) NVM

Figure 11: Response Time CDF of Basic Checkpoint-based Preemption vs. Adaptive Preemption.

We evaluated the checkpoint-based preemption cost interms of CPU, storage and I/O overhead and the results areshown in Figure 12. CPU overhead of preemption is mea-sured as the percentage of CPU time spent on checkpoint-ing and restoring preempted tasks and shown in Figure 12a.Basic checkpointing incurs a 17% CPU overhead when usedwith HDD while the CPU overheads of checkpointing onSSD and NVM are 4% and 0.4%, respectively. When us-ing adaptive checkpointing, the overhead of checkpointingto HDD and SSD drops to 5.1% and 2.3%, respectively.Overall, the CPU overhead is acceptable. With adaptivepreemption on NVM, the CPU cost is negligible.

We use the worst-case scenario to estimate the I/O over-head of checkpointing. We assume that while checkpointinga task, the checkpointing media’s entire bandwidth is used.Using this estimation, the average bandwidth usage of basiccheckpointing is 37%, 14%, and 2.2% of the total availablebandwidth for HDD, SSD and NVM, respectively, as shownin Figure 12b. Adaptive preemption decreases this band-width usage on HDD and SSD to 15.7% and 8.3%, respec-tively. This overhead reduction is due to the combinationof the adaptive policy checkpointing less frequently (optingto kill recently started tasks instead) and also checkpointingless data by leveraging incremental checkpointing. Similarto CPU overhead, bandwidth usage associated with adap-tive preemption on HDD and SSD are acceptably low, andthe overhead is negligible for NVM.

The average storage used for storing checkpoints duringpreemption as a percentage of total storage capacity on HDDand SSD is 5.1% and 7.6%, respectively. The maximum sizeof storage required for storing the checkpoints during execu-tion is the total memory capacity of the cluster if we needto dump and store the entire cluster’s memory state. Forexample, in our workload, there is a production job that is

larger than the capacity of the cluster; when this job is sub-mitted and scheduled, it preempts all non-production jobsrunning in the cluster and causes them to be checkpointed.The storage requirement for our workload is about 10% ofthe total storage capacity.

In summary, the overhead introduced by checkpointing-based preemption is moderate or low. Additionally, whilethe adaptive policy can improve the overall job performance,it can also greatly reduce the CPU and I/O overhead asso-ciated with checkpointing.

6. RELATED WORKSome previous work has studied the negative effects of

preemptive scheduling in shared clusters [6, 16, 5]. Cavdaret al. [4] analyzed task eviction events in the Google clus-ter and found that most evictions were caused by priorityscheduling. They developed task eviction policies to mit-igate wasted resources and response time degradation byimposing a threshold on the number of evictions per task;however, their work is based on simulation and does notconsider checkpointing overhead. Harchol-Balter et. al [13]showed that preemptively migrating long-running processeswould reduce the mean delay time of incoming jobs.

Recently, application-specific checkpointing has been usedto improve resource management. For example, Hadoopcheckpoint-based scheduling proposes to save the progress ofcertain Map tasks in a MapReduce job during preemption [1,6, 19]; however, these systems are limited to checkpointingonly MapReduce applications. Further, these systems of-ten need to modify application programs. In contrast, ourproposed method is application-transparent and a system-level mechanism that can suspend/resume any applicationwithout needing to modify the application code.

Traditional HPC or VM-based suspend/resume solutions

232

0

20

40

60

80

100

HDD SSD NVM

CP

U O

verh

ead

[%] Basic

Adaptive

(a) CPU Overhead

0

20

40

60

80

100

HDD SSD NVM

I/O O

verh

ead

[%] Basic

Adaptive

(b) I/O Overhead

Figure 12: Overhead of Basic Checkpoint-based Preemption and Adaptive Preemption.

are coarse-grained and too expensive for emerging work-loads, such as big-data applications, which require fine-grainedresource sharing and data locality. The most closely re-lated work to ours is SLURM which can checkpoint usingBLCR [2]; however, BLCR is not portable across platformsand is limited in the types of applications it can checkpoint.Yank [23] and SpotCheck [22] offer high-availability to tran-sient servers by storing VM state on backup servers, butdoing so can be expensive if revocations occur frequently.

Analysis of the Google cluster trace has been conductedby [10, 18, 20]. The focus of these works was statistical anal-ysis of the workload’s properties while our focus is on char-acterizing and evaluating the resource efficiency and perfor-mance impact of preemption in cluster scheduling.

System level checkpoint mechanisms such as BLCR, Linux-CR and CRIU use file systems on disk to save checkpoints.Prior work on NVM checkpointing [11, 16] has focused onoptimization techniques and architectural enhancements forimproving reliability and availability. Most of these mecha-nisms have been used for fault-tolerance and none have beenapplied in the context of performance improvement and re-source efficiency in cluster resource management.

7. CONCLUSION AND FUTURE WORKResource management systems in shared clusters typically

employ preemption to recover from saturation and supportthe QoS among multiple tenants. Current preemption mech-anism is to simply kill preempted jobs. This can cause sig-nificant waste and delay the response time of some jobs.

In this paper, we present an alternative non-killing pre-emption that utilizes system-level, application-transparentcheckpointing mechanisms to preserve the progress of pre-empted jobs in order to improve resource efficiency and ap-plication performance in cluster scheduling. We implementa prototype including an implementation on the HadoopYARN platform and conduct an extensive experimental studyvia trace-driven simulation and real applications. We demon-strate that (1) Preemption using application-transparent check-pointing is feasible and able to reduce the resource andpower wastage and improve overall application performancein shared clusters, even on slow storage like HDD. (2) Adap-tive preemption that combines checkpoint and kill can fur-ther improve the performance and reduce cost. (3) Checkpoint-based preemption with slow storage may hurt the perfor-mance of certain jobs. (4) By leveraging emerging fast stor-age technologies such as NVM, checkpoint-based preemptioncan improve application performance in all job categorieswhile achieving significant savings in resource usage.

In the future, we plan to apply the proposed approach toa wider range of applications, including MapReduce and in-vestigate how to implement more efficient checkpointing andpreemption using NVM as virtual memory. With the con-tinued advances in storage technologies and OS-level check-pointing support [8, 16], we anticipate even more savings inthe future as suspend-resume becomes faster and cheaper.

8. ACKNOWLEDGMENTSThis work is done mainly during Jack Li’s internship at

HP Labs. Jack Li and Calton Pu are partially supported byNSF Foundation CNS/SAVI (1250260, 1402266), IUCRC/FRP(1127904), CISE/CNS (1138666, 1421561) programs, andgifts, grants, or contracts from HP, Singapore Government,and Georgia Tech Foundation through the John P. Imlay,Jr. Chair endowment.

9. REFERENCES[1] G. Ananthanarayanan, C. Douglas, R. Ramakrishnan,

S. Rao, and I. Stoica. True elasticity in multi-tenantdata-intensive compute clusters. SoCC ’12, pages24:1–24:7, New York, NY, USA, 2012. ACM.

[2] D. Auble and J. Morris. Simple linux utility forresource management, http://bit.ly/1FpdnQ1. 2013.

[3] E. Boutin, J. Ekanayake, W. Lin, B. Shi, J. Zhou,Z. Qian, M. Wu, and L. Zhou. Apollo: Scalable andcoordinated scheduling for cloud-scale computing.OSDI’14, pages 285–300, Berkeley, CA, USA, 2014.USENIX Association.

[4] D. Cavdar, A. Rosa, L. Y. Chen, W. Binder, andF. Alagoz. Quantifying the brown side of priorityschedulers: Lessons from big clusters. SIGMETRICSPerform. Eval. Rev., 42(3):76–81, Dec. 2014.

[5] L. Cheng, Q. Zhang, and R. Boutaba. Mitigating thenegative impact of preemption on heterogeneousmapreduce workloads. CNSM ’11, pages 189–197,Laxenburg, Austria, Austria, 2011. InternationalFederation for Information Processing.

[6] B. Cho, M. Rahman, T. Chajed, I. Gupta, C. Abad,N. Roberts, and P. Lin. Natjam: Design andevaluation of eviction policies for supporting prioritiesand deadlines in mapreduce clusters. SOCC ’13, pages6:1–6:17, New York, NY, USA, 2013. ACM.

[7] J. Condit, E. B. Nightingale, C. Frost, E. Ipek, B. Lee,D. Burger, and D. Coetzee. Better i/o throughbyte-addressable, persistent memory. SOSP ’09, NewYork, NY, USA, 2009. ACM.

233

[8] CRIU. Checkpoint/restore in userspace,http://criu.org. 2014.

[9] R. R. Curtin, J. R. Cline, N. P. Slagle, W. B. March,P. Ram, N. A. Mehta, and A. G. Gray. MLPACK: Ascalable C++ machine learning library. Journal ofMachine Learning Research, 14:801–805, 2013.

[10] S. Di, D. Kondo, and C. Franck. Characterizing cloudapplications on a Google data center. ICPP’13, Lyon,France, Oct. 2013.

[11] X. Dong, Y. Xie, N. Muralimanohar, and N. P. Jouppi.Hybrid checkpointing using emerging nonvolatilememories for future exascale systems. ACM Trans.Archit. Code Optim., 8(2):6:1–6:29, June 2011.

[12] S. R. Dulloor, S. Kumar, A. Keshavamurthy, P. Lantz,D. Reddy, R. Sankaran, and J. Jackson. Systemsoftware for persistent memory. EuroSys ’14, pages15:1–15:15, New York, NY, USA, 2014. ACM.

[13] M. Harchol-Balter and A. B. Downey. Exploitingprocess lifetime distributions for dynamic loadbalancing. ACM Trans. Comput. Syst., 15(3):253–285,Aug. 1997.

[14] B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi,A. D. Joseph, R. Katz, S. Shenker, and I. Stoica.Mesos: A platform for fine-grained resource sharing inthe data center. NSDI’11, pages 295–308, Berkeley,CA, USA, 2011. USENIX Association.

[15] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly.Dryad: Distributed data-parallel programs fromsequential building blocks. EuroSys ’07, pages 59–72,New York, NY, USA, 2007. ACM.

[16] S. Kannan, A. Gavrilovska, K. Schwan, andD. Milojicic. Optimizing checkpoints using nvm asvirtual memory. IPDPS’13, pages 29–40, May 2013.

[17] M. H. Lankhorst, B. W. Ketelaars, and R. Wolters.Low-cost and nanoscale non-volatile memory conceptfor future silicon chips. Nature materials,4(4):347–352, 2005.

[18] A. K. Mishra, J. L. Hellerstein, W. Cirne, and C. R.

Das. Towards characterizing cloud backend workloads:insights from Google compute clusters. SIGMETRICSPerform. Eval. Rev., 37(4):34–41, Mar. 2010.

[19] J.-A. Quiane-Ruiz, C. Pinkel, J. Schad, andJ. Dittrich. Rafting mapreduce: Fast recovery on theraft. ICDE’11, pages 589–600, April 2011.

[20] C. Reiss, A. Tumanov, G. R. Ganger, R. H. Katz, andM. A. Kozuch. Heterogeneity and dynamicity ofclouds at scale: Google trace analysis. SoCC ’12,NYC, NY, USA, 2012. ACM.

[21] M. Schwarzkopf, A. Konwinski, M. Abd-El-Malek, andJ. Wilkes. Omega: flexible, scalable schedulers forlarge compute clusters. EuroSys’13, pages 351–364,Prague, Czech Republic, 2013.

[22] P. Sharma, S. Lee, T. Guo, D. Irwin, and P. Shenoy.Spotcheck: Designing a derivative iaas cloud on thespot market. In Proceedings of the Tenth EuropeanConference on Computer Systems, EuroSys ’15, pages16:1–16:15, New York, NY, USA, 2015. ACM.

[23] R. Singh, D. Irwin, P. Shenoy, and K. Ramakrishnan.Yank: Enabling green data centers to pull the plug. InPresented as part of the 10th USENIX Symposium onNetworked Systems Design and Implementation (NSDI13), pages 143–155, Lombard, IL, 2013. USENIX.

[24] V. K. Vavilapalli, A. C. Murthy, C. Douglas,S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe,H. Shah, S. Seth, B. Saha, C. Curino, O. O’Malley,S. Radia, B. Reed, and E. Baldeschwieler. Apachehadoop yarn: Yet another resource negotiator. InProceedings of the 4th Annual Symposium on CloudComputing, SOCC ’13, pages 5:1–5:16, New York, NY,USA, 2013. ACM.

[25] J. Wilkes. More Google cluster data. Google researchblog, http://bit.ly/1A38mfR. Nov 2011.

[26] M. Zaharia, M. Chowdhury, M. J. Franklin,S. Shenker, and I. Stoica. Spark: Cluster computingwith working sets. HotCloud’10, Berkeley, CA, USA,2010. USENIX Association.

234

Date post:	22-Sep-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

Improving Preemptive Scheduling with Application...

Documents