Date post: | 26-Dec-2015 |
Category: |
Documents |
Upload: | felix-cornelius-weaver |
View: | 216 times |
Download: | 0 times |
Rescheduling MotivationRescheduling Motivation
Heterogeneity and contention can Heterogeneity and contention can cause application’s performance vary cause application’s performance vary over timeover time
Rescheduling decisions in response Rescheduling decisions in response to changes in resource performance to changes in resource performance is necessaryis necessary Performance degradation of the running Performance degradation of the running
applicationsapplications Availability of “better” resourcesAvailability of “better” resources
Modeling the Cost of RedistributionModeling the Cost of Redistribution
CCthresholdthreshold depends on: depends on: Model accuracyModel accuracy Load dynamics of the systemLoad dynamics of the system
Redistribution Cost Model for Redistribution Cost Model for Jacobi 2DJacobi 2D
• Emax – average iteration time of the processor that is farthest behind
• Cdev – processor performance deviation variable
ExperimentsExperiments
8 processors were used8 processors were used
A loading event consisting of parallel A loading event consisting of parallel program was introduced 3 minutes program was introduced 3 minutes after Jacobi startedafter Jacobi started
Number of tasks of the loading event Number of tasks of the loading event variedvaried
CCthresholdthreshold – 15 seconds – 15 seconds
Malleable JobsMalleable Jobs
Parallel JobsParallel Jobs Rigid – only one set of processorsRigid – only one set of processors Moldable – flexible during job starts, but Moldable – flexible during job starts, but
cannot be reconfigured during executioncannot be reconfigured during execution Malleable – flexible during job start as Malleable – flexible during job start as
well as during executionwell as during execution
Rescheduling in GrADSRescheduling in GrADS•Performance-oriented migration framework
•Tightly coupled policies for suspension and migration
•Takes into account load characteristics, remaining execution times
•Migration of application depends on:
•The amount of increase or decrease in loads on the system
•The time of the application execution when load is introduced into the system
•The performance benefits that can be obtained due to migrationComponents:
1. Migrator
2. Contract Monitor
3. Rescheduler
SRS Checkpointing LibrarySRS Checkpointing Library
End application instrumented with user-level checkpointing libraryEnd application instrumented with user-level checkpointing libraryEnables reconfiguration of executing applications across distinct Enables reconfiguration of executing applications across distinct domainsdomainsAllows fault toleranceAllows fault toleranceUses IBP (Internet Backplane Protocol) for storage and retrieval of Uses IBP (Internet Backplane Protocol) for storage and retrieval of checkpointscheckpointsNeeds Runtime Support System (RSS) – an auxiliary daemon that is Needs Runtime Support System (RSS) – an auxiliary daemon that is started with the parallel applicationstarted with the parallel applicationSimple APISimple API
- SRS_Init()- SRS_Init() - SRS_Restart_Value()- SRS_Restart_Value() - SRS_Register()- SRS_Register() - SRS_Check_Stop()- SRS_Check_Stop() - SRS_Read()- SRS_Read() - SRS_Finish()- SRS_Finish() - SRS_StoreMap(), SRS_DistributeFunc_Create(), - SRS_StoreMap(), SRS_DistributeFunc_Create(),
SRS_DistributeMap_Create()SRS_DistributeMap_Create()
SRS INTERNALSSRS INTERNALSMPI Application
SRS
IBP IBP IBP
Runtime SupportSystem (RSS)
Start
PollSTOP
STOP
Read with possible redistribution ReStart
SRS APISRS API/* begin code *//* begin code */
MPI_Init()MPI_Init()
/* initialize data *//* initialize data */
loop{loop{
}}
MPI_Finalize()MPI_Finalize()
/* begin code *//* begin code */
MPI_Init()MPI_Init()SRS_Init()SRS_Init()
restart_value = restart_value = SRS_Restart_Value()SRS_Restart_Value()
if(restart_value == 0){if(restart_value == 0){ /* initialize data *//* initialize data */}}else{else{ SRS_Read(“data”, data, BLOCK, NULL)SRS_Read(“data”, data, BLOCK, NULL)}}
SRS_Register(“data”, data, SRS_INT, data_size, BLOCK, NULL)SRS_Register(“data”, data, SRS_INT, data_size, BLOCK, NULL)
loop{loop{ stop_value = SRS_Check_Stop()stop_value = SRS_Check_Stop() if(stop_value == 1){if(stop_value == 1){ exit();exit(); }}}}
SRS_Finish()SRS_Finish()MPI_Finalize()MPI_Finalize()
Original code SRS Instrumented code
SRS Example – Original CodeSRS Example – Original Code MPI_Init(&argc, &argv);MPI_Init(&argc, &argv);
local_size = global_size/size;local_size = global_size/size;
if(rank == 0){if(rank == 0){
for(i=0; i<global_size; i++){for(i=0; i<global_size; i++){
global_A[i] = i;global_A[i] = i;
}}
}}
MPI_Scatter (global_A, local_size, MPI_INT, local_A, local_size, MPI_INT, 0, comm);MPI_Scatter (global_A, local_size, MPI_INT, local_A, local_size, MPI_INT, 0, comm);
iter_start = 0;iter_start = 0;
for(i=iter_start; i<global_size; i++){for(i=iter_start; i<global_size; i++){
proc_number = i/local_size;proc_number = i/local_size;
local_index = i%local_size;local_index = i%local_size;
if(rank == proc_number){if(rank == proc_number){
local_A[local_index] += 10;local_A[local_index] += 10;
}}
}}
MPI_Finalize();MPI_Finalize();
SRS Example – Modified CodeSRS Example – Modified Code MPI_Init(&argc, &argv);MPI_Init(&argc, &argv);
SRS_Init();SRS_Init();
local_size = global_size/size;local_size = global_size/size;
restart_value = SRS_Restart_Value();restart_value = SRS_Restart_Value();
if(restart_value == 0){if(restart_value == 0){
if(rank == 0){if(rank == 0){
for(i=0; i<global_size; i++){for(i=0; i<global_size; i++){
global_A[i] = i;global_A[i] = i;
}}
}}
MPI_Scatter (global_A, local_size, MPI_INT, local_A, local_size, MPI_INT, 0, comm);MPI_Scatter (global_A, local_size, MPI_INT, local_A, local_size, MPI_INT, 0, comm);
iter_start = 0;iter_start = 0;
}}
else{else{
SRS_Read(“A”, local_A, BLOCK, NULL);SRS_Read(“A”, local_A, BLOCK, NULL);
SRS_Read(“iterator”, &iter_start, SAME, NULL);SRS_Read(“iterator”, &iter_start, SAME, NULL);
}}
SRS_Register(“A”, local_A, GRADS_INT, local_size, BLOCK, NULL);SRS_Register(“A”, local_A, GRADS_INT, local_size, BLOCK, NULL);
SRS_Register(“iterator”, &I, GRADS_INT, 1, 0, NULL);SRS_Register(“iterator”, &I, GRADS_INT, 1, 0, NULL);
SRS Example – Modified SRS Example – Modified Code (Contd..)Code (Contd..)
for(i=iter_start; i<global_size; i++){for(i=iter_start; i<global_size; i++){
stop_value = SRS_Check_Stop();stop_value = SRS_Check_Stop();
if(stop_value == 1){if(stop_value == 1){
MPI_Finalize();MPI_Finalize();
exit(0);exit(0);
}}
proc_number = i/local_size;proc_number = i/local_size;
local_index = i%local_size;local_index = i%local_size;
if(rank == proc_number){if(rank == proc_number){
local_A[local_index] += 10;local_A[local_index] += 10;
}}
}}
SRS_Finish();SRS_Finish();
MPI_Finalize();MPI_Finalize();
Components (Continued..)Components (Continued..)Contract Monitor:
» Monitors the progress of the end application» Tolerance limits specified to the contract
monitor» Upper contract limit – 2.0» Lower contract limit – 0.7
» When it receives the actual execution time for an iteration from the application» calculates ratio between actual and
predicted» Adds it to the average ratio» Adds it to the last_5_avg
Contract MonitorContract Monitor
If average ratio > upper contract limitIf average ratio > upper contract limit Contact reschedulerContact rescheduler Request for reschedulingRequest for rescheduling Receive replyReceive reply If reply is “SORRY. CANNOT RESCHEDULE”If reply is “SORRY. CANNOT RESCHEDULE”
Calculate new_predicted_time based on last_5_avg Calculate new_predicted_time based on last_5_avg and orig_predicted_timeand orig_predicted_timeAdjust upper_contract_limit based on Adjust upper_contract_limit based on new_predicted_time, prev_predicted_time, new_predicted_time, prev_predicted_time, prev_upper_contract_limitprev_upper_contract_limitAdjust lower_contract_limit based on Adjust lower_contract_limit based on new_predicted_time, prev_predicted_time, new_predicted_time, prev_predicted_time, prev_lower_contract_limitprev_lower_contract_limitprev_predicted_time = new_predicted_timeprev_predicted_time = new_predicted_time
Contract MonitorContract Monitor
If average ratio < lower contract limitIf average ratio < lower contract limit Calculate new_predicted_time based on Calculate new_predicted_time based on
last_5_avg and orig_predicted_timelast_5_avg and orig_predicted_time Adjust upper_contract_limit based on Adjust upper_contract_limit based on
new_predicted_time, prev_predicted_time, new_predicted_time, prev_predicted_time, prev_upper_contract_limitprev_upper_contract_limit
Adjust lower_contract_limit based on Adjust lower_contract_limit based on new_predicted_time, prev_predicted_time, new_predicted_time, prev_predicted_time, prev_lower_contract_limitprev_lower_contract_limit
prev_predicted_time = new_predicted_timeprev_predicted_time = new_predicted_time
ReschedulerRescheduler
A metascheduling serviceA metascheduling serviceOperates in 2 modesOperates in 2 modes When contract monitor requests for When contract monitor requests for
rescheduling – i.e. during performance rescheduling – i.e. during performance degradationdegradation
Periodically queries Database manager Periodically queries Database manager for recently completed GrADS for recently completed GrADS applications, migrates executing applications, migrates executing applications to make use of freed applications to make use of freed resources – i.e. opportunistic resources – i.e. opportunistic reschedulingrescheduling
Application and Metascheduler InteractionsApplication and Metascheduler Interactions
User
ResourceSelection
RequestingPermission
PermissionService
Permission?
Application SpecificScheduling
ContractDevelopment
ContractNegotiator
ContractApproved?
ApplicationLaunching
Problem parameters
Initial list of machines
PermissionNO
YES
Abort
Exit
Get new resource information
Application specific schedule
Get new resource information
NOYES
ApplicationCompletion?
Application Completed
Wait for restartsignal
Application was stopped
Problem parameters, final schedule Get new resource
information
Rescheduler ArchitectureRescheduler ArchitectureApplicationLaunching
ExitApplicationCompletion?
Application Completed
Wait for restartsignal
Application was stopped
Get new resource information
Application Manager
ApplicationContractMonitor
RuntimeSupportSystem(RSS)
Execution time
Query for STOP signal
DatabaseManager
ReschedulerRequest for migration
Store STOP
Send STOP signal
Store RESUME
Static Rescheduling CostStatic Rescheduling CostRescheduling PhaseRescheduling Phase Time (seconds)Time (seconds)
Writing checkpointsWriting checkpoints 4040
Waiting for NWS updateWaiting for NWS update 9090
NWS retrieval timeNWS retrieval time 120120
Application-level schedulingApplication-level scheduling 8080
Other Grid overheadOther Grid overhead 1010
Starting applicationStarting application 6060
Reading checkpoints and data Reading checkpoints and data redistributionredistribution
500500
TotalTotal 900900
Experiments and ResultsExperiments and ResultsRescheduling on requestRescheduling on request
Different problem sizes of ScaLAPACK QRDifferent problem sizes of ScaLAPACK QRmsc – fast machines; opus – slow machinesmsc – fast machines; opus – slow machinesInitial set of resources consisted of 4 msc and 8 Initial set of resources consisted of 4 msc and 8 opus machinesopus machinesThe performance model always chose 4 msc The performance model always chose 4 msc machines for application runmachines for application run5 minutes into the application run, artificial load 5 minutes into the application run, artificial load is introduced on 4 msc machinesis introduced on 4 msc machinesThe application migrated from UT to UIUCThe application migrated from UT to UIUC
No rescheduling
Rescheduling
Rescheduler decided not to reschedule for size
8000.Wrong decision!
Rescheduling Depending on Amount of LoadRescheduling Depending on Amount of Load
ScaLAPACK QR problem size – 12000ScaLAPACK QR problem size – 12000
Load introduced 20 minutes after Load introduced 20 minutes after application startapplication start
The amount of load was variedThe amount of load was varied
No rescheduling
Rescheduling
Rescheduler decided not to reschedule.Wrong decision!
Rescheduling Depending on Load Introduction TimeRescheduling Depending on Load Introduction Time
ScaLAPACK QR problem size – 12000ScaLAPACK QR problem size – 12000
Same load introduced at different points of Same load introduced at different points of application executionapplication execution
No reschedulingRescheduling
Rescheduler decided not to reschedule.Wrong decision!
Experiments and Results Experiments and Results Opportunistic ReschedulingOpportunistic Rescheduling
Two problems –Two problems – - 1- 1stst problem, size 14000 executing on 6 problem, size 14000 executing on 6 mscmsc machines. machines. - 2- 2ndnd problem of varying sizes. problem of varying sizes.
2nd problem introduced 2 minutes after the start of 12nd problem introduced 2 minutes after the start of 1stst problem.problem.Initial set of resources for the 2Initial set of resources for the 2ndnd problem consisted of 6 problem consisted of 6 mscmsc machines and 2 machines and 2 opusopus machines. machines.Due to the presence of 1Due to the presence of 1stst problem, the 2 problem, the 2ndnd problem had to problem had to use both the use both the mscmsc and and opusopus machines, hence involved machines, hence involved Internet bandwidth.Internet bandwidth.After 1After 1stst problem completes, the 2 problem completes, the 2ndnd problem can be problem can be rescheduled to use only the rescheduled to use only the mscmsc machines. machines.
Large problem
Large problem
No rescheduling
No rescheduling
Large problem
Large problem
No rescheduling
No rescheduling
ReschedulingRescheduling
Dynamic Prediction of Dynamic Prediction of Rescheduling CostRescheduling Cost
The rescheduler, during rescheduling The rescheduler, during rescheduling decision, contacts RSS and obtains decision, contacts RSS and obtains data distributions of datadata distributions of data
Forms old and new data mapsForms old and new data maps
Based on maps and current NWS Based on maps and current NWS information, predicts redistribution information, predicts redistribution costcost
Dynamic Prediction of Dynamic Prediction of Rescheduling CostRescheduling Cost
Application started on: 4 mscs
Application restarted on: 8 opus
References / Sources / creditsReferences / Sources / credits
Predicting the Cost of Redistribution in SchedulingPredicting the Cost of Redistribution in Schedulingby by Gary Shao, Rich WolskiGary Shao, Rich Wolski and and Fran BermanFran BermanProceedings of the 8th SIAM Conference on Parallel Processing for Proceedings of the 8th SIAM Conference on Parallel Processing for Scientific ComputingScientific ComputingVadhiyar, S. and Dongarra, J. “Vadhiyar, S. and Dongarra, J. “Performance Oriented Migration Performance Oriented Migration Framework for the GridFramework for the Grid”. ”. Proceedings of The 3rd IEEE/ACM Proceedings of The 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid International Symposium on Cluster Computing and the Grid (CCGrid 2003)(CCGrid 2003), pp 130-137, May 2003, Tokyo, Japan., pp 130-137, May 2003, Tokyo, Japan.L. V. Kale, Sameer Kumar, and J. DeSouzaL. V. Kale, Sameer Kumar, and J. DeSouzaA Malleable-Job System for Timeshared Parallel Machines A Malleable-Job System for Timeshared Parallel Machines 2nd IEEE/ACM International Symposium on Cluster Computing and 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid 2002), May 21-24, 2002, Berlin, Germany. the Grid (CCGrid 2002), May 21-24, 2002, Berlin, Germany. See Cactus migration thornSee Cactus migration thornSee opportunistic migration by HuedoSee opportunistic migration by Huedo
GridWayGridWay
Migration:Migration: When performance degradation happensWhen performance degradation happens When “better” resources are discoveredWhen “better” resources are discovered When requirements changeWhen requirements change Owner decisionOwner decision Remote resource failureRemote resource failure
Rescheduling done at discovery intervalRescheduling done at discovery intervalPerformance degradation evaluator Performance degradation evaluator program executed at monitoring intervalprogram executed at monitoring interval
ComponentsComponents Request managerRequest manager Dispatch managerDispatch manager Submission manager – prologing, submitting, Submission manager – prologing, submitting,
canceling, epilogingcanceling, epiloging Performance monitorPerformance monitor
Application specific componentsApplication specific components Resource selectorResource selector Performance degradation evaluatorPerformance degradation evaluator PrologProlog WrapperWrapper epilogepilog
Opportunistic Job MigrationOpportunistic Job Migration
FactorsFactors Performance of new hostPerformance of new host Remaining execution time of applicationRemaining execution time of application Proximity of new resource to the needed dataProximity of new resource to the needed data
Dynamic Space sharing on clusters Dynamic Space sharing on clusters of non-dedicated workstations of non-dedicated workstations
(Chowdhury et. al.)(Chowdhury et. al.)Dynamic reconfiguration – application Dynamic reconfiguration – application level approach for dynamic reconfiguration level approach for dynamic reconfiguration of grid-based iterative applicationsof grid-based iterative applications
SRS Data Redistribution CostSRS Data Redistribution Cost
Started on – 8 MSCsRestarted on – 8 OPUS, 2MSCs
GridRoutine /
ApplicationManager
User
Modified GrADS ArchitectureModified GrADS Architecture
ResourceSelector
PerformanceModeler
ContractDeveloper
AppLauncher
ContractMonitor
Application
MDS
NWSPermission
Service
RSS
ContractNegotiator
Rescheduler
DatabaseManager
Another approach: AMPIAnother approach: AMPI
AMPI – MPI implementation on top of Charm++AMPI – MPI implementation on top of Charm++Processes implemented as user-level threadsProcesses implemented as user-level threadsCharm++ provides load balancing framework, Charm++ provides load balancing framework, migrates threadsmigrates threadsThe load balancing framework accepts The load balancing framework accepts processor mapprocessor mapParallel job started on all processors in the Parallel job started on all processors in the systemsystemAllocates work to only processors in the Allocates work to only processors in the processor map, i.e. threads/objects are assigned processor map, i.e. threads/objects are assigned to processors in the processor mapto processors in the processor map
ReschedulingRescheduling
When processor map changesWhen processor map changes Threads are migrated to new set of Threads are migrated to new set of
processors in the processor mapprocessors in the processor map Skeleton processes left behind in the vacated Skeleton processes left behind in the vacated
processorsprocessors A skeleton forwards messages to A skeleton forwards messages to
threads/objects previously housed in the threads/objects previously housed in the processorprocessor
New processor conveyed to load balancer New processor conveyed to load balancer framework by adaptive job schedulerframework by adaptive job scheduler
OverheadOverhead
Shrink or expand time depends on:Shrink or expand time depends on:per-process data that has to be transferredper-process data that has to be transferredNumber of processors involvedNumber of processors involved
Adaptive Job SchedulerAdaptive Job Scheduler
Variant of dynamic equipartitioning strategyVariant of dynamic equipartitioning strategyEach job specifies min. and max. number of Each job specifies min. and max. number of procs. that it can run on.procs. that it can run on.The scheduler recalculates the number of procs. The scheduler recalculates the number of procs. assigned to each running jobassigned to each running jobRunning jobs and new job are first assigned the Running jobs and new job are first assigned the minimum requirementminimum requirementThe left over procs. are equally divided among The left over procs. are equally divided among all the jobsall the jobsThe new job is assigned to a queue if it cannot The new job is assigned to a queue if it cannot be allocated its minimum requirementbe allocated its minimum requirement
SchedulingScheduling
Same strategy followed when jobs Same strategy followed when jobs completecomplete
The scheduler conveys the decision by bit-The scheduler conveys the decision by bit-vector to jobsvector to jobs
Jobs do thread migrationJobs do thread migration
ExperimentsExperiments
32 processor Linux cluster32 processor Linux clusterJob arrival by Poisson processJob arrival by Poisson processEach job – a molecular dynamics (MD) program Each job – a molecular dynamics (MD) program with 50,000 atoms with different number of with 50,000 atoms with different number of iterationsiterationsNumber of iterations exponentially distributedNumber of iterations exponentially distributedMinimum number of procs., minpe – uniformly Minimum number of procs., minpe – uniformly distributed between 1 and 64distributed between 1 and 64maxpe – 64maxpe – 64Each experiment – 50 job arrivalsEach experiment – 50 job arrivals
Dynamic reconfigurationDynamic reconfiguration
Ability to change number Ability to change number of processors during of processors during executionexecution
Condor like environmentCondor like environment Respect ownerships of Respect ownerships of
workstationsworkstations Provide high performance Provide high performance
for parallel applicationsfor parallel applications
Dynamic reconfiguration Dynamic reconfiguration also provides high also provides high throughput for the systemthroughput for the system