Date post: | 30-Dec-2015 |
Category: |
Documents |
Upload: | rudyard-tate |
View: | 20 times |
Download: | 0 times |
JSSPP-11, Boston, MA June 19, 2005
1
Pitfalls in Parallel Job Scheduling Evaluation
Designing Parallel Operating Systems
using Modern Interconnects
Pitfalls in Parallel Job Scheduling Evaluation
Eitan Frachtenberg and Dror Feitelson
Computer and Computational Sciences DivisionLos Alamos National Laboratory
Ideas that change the world
JSSPP-11, Boston, MA June 19, 2005
2
Pitfalls in Parallel Job Scheduling Evaluation
Scope
Numerous methodological issues occur with the Numerous methodological issues occur with the evaluation of parallel job schedulers:evaluation of parallel job schedulers:
Experiment theory and designExperiment theory and design
Workloads and applicationsWorkloads and applications
Implementation issues and assumptionsImplementation issues and assumptions
Metrics and statisticsMetrics and statistics
Paper covers 32 recurring pitfalls, organized into Paper covers 32 recurring pitfalls, organized into topics and sorted by severitytopics and sorted by severity
Talk will describe a real case study, and the heroic Talk will describe a real case study, and the heroic attempts to avoid most such pitfallsattempts to avoid most such pitfalls
……as well as the less-heroic oversight of several othersas well as the less-heroic oversight of several others
JSSPP-11, Boston, MA June 19, 2005
3
Pitfalls in Parallel Job Scheduling Evaluation
Evaluation Paths
Theoretical Analysis (queuing theory):Theoretical Analysis (queuing theory): Reproducible, rigorous, and resource-friendlyReproducible, rigorous, and resource-friendly Hard for time slicing due to unknown parameters, Hard for time slicing due to unknown parameters,
application structure, and feedbacksapplication structure, and feedbacks
Simulation:Simulation: Relatively simple and flexibleRelatively simple and flexible Many assumptions, not all known/reported; hard to Many assumptions, not all known/reported; hard to
reproduce; rarely factors application characteristics reproduce; rarely factors application characteristics
Experiments with real sites and workloads:Experiments with real sites and workloads: Most representative (at least locally)Most representative (at least locally) Largely impractical and irreproducible Largely impractical and irreproducible
EmulationEmulation
JSSPP-11, Boston, MA June 19, 2005
4
Pitfalls in Parallel Job Scheduling Evaluation
Emulation Environment
Experimental platform consisting of three clusters Experimental platform consisting of three clusters with high-end networkwith high-end network
Software: several job scheduling algorithms Software: several job scheduling algorithms implemented on top of STORM:implemented on top of STORM:
Batch / space sharing, with optional EASY backfillingBatch / space sharing, with optional EASY backfilling
Gang Scheduling, Implicit Coscheduling (SB), Gang Scheduling, Implicit Coscheduling (SB), Flexible CoschedulingFlexible Coscheduling
Results described in [JSSPP’03] and [TPDS’05]Results described in [JSSPP’03] and [TPDS’05]
JSSPP-11, Boston, MA June 19, 2005
5
Pitfalls in Parallel Job Scheduling Evaluation
Step One: Choosing Workload
Static vs. DynamicStatic vs. Dynamic
Size of workloadSize of workload
How many different workloads are needed?How many different workloads are needed?
Use trace data?Use trace data?
Different sites have different workload characteristics Different sites have different workload characteristics
Inconvenient sizes may require imprecise scalingInconvenient sizes may require imprecise scaling
““Polluted” data, flurriesPolluted” data, flurries
Use model-generated data?Use model-generated data?
Several models exist, with different strengthsSeveral models exist, with different strengths
By trying to capture everything, may capture nothingBy trying to capture everything, may capture nothing
JSSPP-11, Boston, MA June 19, 2005
6
Pitfalls in Parallel Job Scheduling Evaluation
Static Workloads
We start with a synthetic We start with a synthetic application & static workloads application & static workloads Simple enough to model, Simple enough to model, debug, and calibratedebug, and calibrateBulk-synchronous applicationBulk-synchronous applicationCan control: granularity, Can control: granularity, variability and variability and Communication patternCommunication pattern
JSSPP-11, Boston, MA June 19, 2005
7
Pitfalls in Parallel Job Scheduling Evaluation
Synthetic Scenarios
Balanced Complementing Imbalanced Mixed
JSSPP-11, Boston, MA June 19, 2005
8
Pitfalls in Parallel Job Scheduling Evaluation
Example: Turnaround Time
0
50
100
150
200
250
300
350
400
Balanced Imbalanced Complementing Mixed
FCFS GS SB FCS Optimal
JSSPP-11, Boston, MA June 19, 2005
9
Pitfalls in Parallel Job Scheduling Evaluation
Dynamic Workloads
We chose Lublin’s model [JPDC’03]We chose Lublin’s model [JPDC’03]
1000 jobs per workload1000 jobs per workload
Multiplying run-times AND arrival times by constant Multiplying run-times AND arrival times by constant to “shrink” run time (2-4 hours)to “shrink” run time (2-4 hours)
Shrinking too much is problematic (system constants)Shrinking too much is problematic (system constants)
Multiplying arrival times by a range of factors to Multiplying arrival times by a range of factors to modify loadmodify load
Unrepresentative, since deviates from “real” Unrepresentative, since deviates from “real” correlations with run times and job sizes.correlations with run times and job sizes.
Better solution is to use different workloads Better solution is to use different workloads
JSSPP-11, Boston, MA June 19, 2005
10
Pitfalls in Parallel Job Scheduling Evaluation
Synthetic applications are easy to control, but:Synthetic applications are easy to control, but:
Some characteristics are ignored (e.g., I/O, memory)Some characteristics are ignored (e.g., I/O, memory)
Others may not be representative, in particular Others may not be representative, in particular communication, which is salient of parallel apps.communication, which is salient of parallel apps.
Granularity, pattern, network performanceGranularity, pattern, network performance
If not sure, conduct sensitivity analysisIf not sure, conduct sensitivity analysis
Might be assumed malleable, moldable, or with linear Might be assumed malleable, moldable, or with linear speedup, which many MPI applications are notspeedup, which many MPI applications are not
Real applications have no hidden assumptionsReal applications have no hidden assumptions
But may also have limited generalityBut may also have limited generality
Step Two: Choosing Applications
JSSPP-11, Boston, MA June 19, 2005
11
Pitfalls in Parallel Job Scheduling Evaluation
Example: Sensitivity Analysis
JSSPP-11, Boston, MA June 19, 2005
12
Pitfalls in Parallel Job Scheduling Evaluation
Application Choices
Synthetic applications on first setSynthetic applications on first set
Allows control over more parametersAllows control over more parameters
Allows testing unrealistic but interesting conditions Allows testing unrealistic but interesting conditions (e.g., high multiprogramming level)(e.g., high multiprogramming level)
LANL applications on second set (Sweep3D, Sage)LANL applications on second set (Sweep3D, Sage)
Real memory and communication use (MPL=2)Real memory and communication use (MPL=2)
Important applications for LANL’s evaluationsImportant applications for LANL’s evaluations
But probably only for LANL…But probably only for LANL…
Runtime estimate: f-model on batch, MPL on othersRuntime estimate: f-model on batch, MPL on others
JSSPP-11, Boston, MA June 19, 2005
13
Pitfalls in Parallel Job Scheduling Evaluation
Step Three: Choosing Parameters
What are reasonable input parameters to use in the What are reasonable input parameters to use in the evaluation?evaluation?
Maximum multiprogramming level (MPL)Maximum multiprogramming level (MPL)
Timeslice quantumTimeslice quantum
Input loadInput load
Backfilling method and effect on multiprogrammingBackfilling method and effect on multiprogramming
Run time estimate factor (not tested)Run time estimate factor (not tested)
Algorithm constants, tuning, etc.Algorithm constants, tuning, etc.
JSSPP-11, Boston, MA June 19, 2005
14
Pitfalls in Parallel Job Scheduling Evaluation
Example 1: MPL
Verified with different offered loadsVerified with different offered loads
JSSPP-11, Boston, MA June 19, 2005
15
Pitfalls in Parallel Job Scheduling Evaluation
Example 2: Timeslice
Dividing to quantiles allows analysis of effect on Dividing to quantiles allows analysis of effect on different job typesdifferent job types
JSSPP-11, Boston, MA June 19, 2005
16
Pitfalls in Parallel Job Scheduling Evaluation
Considerations for Parameters
Realistic MPLsRealistic MPLs
Scaling traces to different machine sizesScaling traces to different machine sizes
Scaling offered loadScaling offered load
Artificial user estimates and multiprogramming Artificial user estimates and multiprogramming estimatesestimates
JSSPP-11, Boston, MA June 19, 2005
17
Pitfalls in Parallel Job Scheduling Evaluation
Step Four: Choosing Metrics
Not all metrics are easily comparable:Not all metrics are easily comparable:
Absolute times, slowdown with time slicing, etc.Absolute times, slowdown with time slicing, etc.
Metrics may need to be limited to a relevant contextMetrics may need to be limited to a relevant context
Use multiple metrics to understand characteristicsUse multiple metrics to understand characteristics
Measuring utilization for an Measuring utilization for an openopen model model
Direct measure of offered load till saturationDirect measure of offered load till saturation
Same goes for throughput and makespanSame goes for throughput and makespan
Better metrics: slowdown, response time, wait timeBetter metrics: slowdown, response time, wait time
Using mean with asymmetric distributionsUsing mean with asymmetric distributions
Inferring scalability from O(1) nodesInferring scalability from O(1) nodes
JSSPP-11, Boston, MA June 19, 2005
18
Pitfalls in Parallel Job Scheduling Evaluation
Example: Bounded Slowdown
JSSPP-11, Boston, MA June 19, 2005
19
Pitfalls in Parallel Job Scheduling Evaluation
Example (continued)
JSSPP-11, Boston, MA June 19, 2005
20
Pitfalls in Parallel Job Scheduling Evaluation
Response Time
JSSPP-11, Boston, MA June 19, 2005
21
Pitfalls in Parallel Job Scheduling Evaluation
Bounded Slowdown
JSSPP-11, Boston, MA June 19, 2005
22
Pitfalls in Parallel Job Scheduling Evaluation
Step Five: Measurement
Never measure saturated workloadsNever measure saturated workloads
When arrival rate is higher than service rate, queues When arrival rate is higher than service rate, queues grow to infinity; all metrics become meaninglessgrow to infinity; all metrics become meaningless
……but finding saturation point can be trickybut finding saturation point can be tricky
Discard warm-up and cool-down resultsDiscard warm-up and cool-down results
May need to measure subgroups separately May need to measure subgroups separately (long/short, day/night, weekday/weekend,…)(long/short, day/night, weekday/weekend,…)
Measurement should still have enough data points Measurement should still have enough data points for statistical meaning, especially workload lengthfor statistical meaning, especially workload length
JSSPP-11, Boston, MA June 19, 2005
23
Pitfalls in Parallel Job Scheduling Evaluation
Example: Saturation Point
JSSPP-11, Boston, MA June 19, 2005
24
Pitfalls in Parallel Job Scheduling Evaluation
Example: Shortest jobs CDF
JSSPP-11, Boston, MA June 19, 2005
25
Pitfalls in Parallel Job Scheduling Evaluation
Example: Longest jobs CDF
JSSPP-11, Boston, MA June 19, 2005
26
Pitfalls in Parallel Job Scheduling Evaluation
Conclusion
Parallel Job Scheduling Evaluation is complexParallel Job Scheduling Evaluation is complex
……but we can avoid past mistakesbut we can avoid past mistakes
Paper can be used as a checklist to work with when Paper can be used as a checklist to work with when designing and executing evaluationsdesigning and executing evaluations
Additional information in paper:Additional information in paper:
Pitfalls, examples, and scenariosPitfalls, examples, and scenarios
Suggestions on how to avoid pitfallsSuggestions on how to avoid pitfalls
Open research questions (for next JSSPP?)Open research questions (for next JSSPP?)
Many references to positive examplesMany references to positive examples
Be cognizant when Choosing your compromisesBe cognizant when Choosing your compromises
JSSPP-11, Boston, MA June 19, 2005
27
Pitfalls in Parallel Job Scheduling Evaluation
References
Workload archive:Workload archive:
http://www.cs.huji.ac.il/~feit/workladhttp://www.cs.huji.ac.il/~feit/worklad
Contains several workload traces and modelsContains several workload traces and models
Dror’s publication pageDror’s publication page
http://http://www.cs.huji.ac.il/~feit/pub.htmlwww.cs.huji.ac.il/~feit/pub.html
Eitan’s publication pageEitan’s publication page
http://www.cs.huji.ac.il/~etcs/pubshttp://www.cs.huji.ac.il/~etcs/pubs
Email: [email protected]: [email protected]