Distributed Operating Systems
ProcessScheduling
ProcessScheduling
Fernando PérezJosé María Peña
Víctor RoblesFrancisco Rosales
Process Management
1. Concepts and Taxonomies: Jobs and parallel/distributed systems
2. Static scheduling: Scheduling dependent tasksScheduling parallel tasksScheduling tasks from multiple jobs
3. Dynamic scheduling:Load balancingProcess migrationData migrationConnection balancing
Fernando PérezJosé María Peña
Víctor RoblesFrancisco Rosales
Distributed Operating Systems3
Starting Scenario: Concepts
• Jobs: Set of tasks (processes or threads) that require (resources x time)– Resource: Data, devices, CPU or other required (finite) elements to
carry out a job.– Time: Period when resources are assigned (shared or dedicated) to a
certain job.– Task relationship: Tasks should be performed in order keeping the
restrictions based on needed input or resources. • Scheduling: Assigning jobs and their tasks to computational
resources (specially CPU). Scheduling could be monitored, reviewed and changed along time.
Fernando PérezJosé María Peña
Víctor RoblesFrancisco Rosales
Distributed Operating Systems4
Starting Scenario
Required Resources
JobsTasks
Nodes(Processors)
GOALTo assign users’ jobs to the nodes, with the objective to improve sequential performance.
Fernando PérezJosé María Peña
Víctor RoblesFrancisco Rosales
Scheduling Characteristics
• Shared memory systems– Any of the processor can access the resources used by a task:
• Memory space• Internal OS resources (files, connections, etc.)
– Automatic load sharing/balancing:• Free processors execute any process (task) in ready status
– Improvements derived from efficient process managements:• Better resource usage and performance• Parallel application uses available processors
• Distributed systems– Tasks are assigned to a processor for their whole running time– The resources used by a task can only be accessed from the local
processors.– Load balancing requires process migration
Distributed Operating Systems5
Fernando PérezJosé María Peña
Víctor RoblesFrancisco Rosales
Distributed Operating Systems6
Starting Scenario: Jobs
What do we execute? Jobs are divided into tasks:• Independent tasks
– Independent processes– They could belong to different users
• Coopering tasks– They interact somehow– Belonging to the same application– There can be some dependencies– Or they can require parallel execution
Fernando PérezJosé María Peña
Víctor RoblesFrancisco Rosales
Distributed Operating Systems7
Coopering Tasks
Task dependencies• Model based on a direct acyclic
graph (DAG).
• Example: Workflow
Parallel execution• It requires a number of parallel
executing tasks at the same moment:– Synchronous or asynchronous
interactions.– Based on a connection topology.– Either master/slave or fully
distributed model.– Particular communication ratios and
message exchanges.
• Example: MPI code
Tasks
Data transferred
Fernando PérezJosé María Peña
Víctor RoblesFrancisco Rosales
Distributed Operating Systems8
Starting Scenarios: Objectives
What kind of “better performance” are we expecting?System taxonomy.• High-availability systems
– HAS: High Availability Systems– Service should be always working– Fault tolerance
• High-performance systems – HPC: High Performance Computing– Reaching a higher computational power– Execution one heavy job in less time.
• High-throughput systems – HTS: High Throughput Systems– Number of executed jobs should be maximized– Optimizing the resource usage or the number of clients (could represent a different
objective).
Fernando PérezJosé María Peña
Víctor RoblesFrancisco Rosales
Scheduling
• Scheduling deals with the distribution of tasks on a distributedcomputing platform:– Attending to resource requirements– Attending to task inter-dependencies
• Final performance depends on:– Concurrency: Maximum number of processors running in parallel.– Parallelism degree: The smallest degree in which a parallel job can
be divided into tasks.– Communication costs: Could be different between processors in the
same node than different nodes.– Shared resources: Common resources (like memory) shared among
all the tasks running in the same node.
Fernando PérezJosé María Peña
Víctor RoblesFrancisco Rosales
Scheduling
• Processor usage:– Exclusive: One task to one processor.– Shared: If the tasks performs few I/O phases performance is limited.
It is not the usual strategy.• Task scheduling can be planned as:
– Static Scheduling: The system decides previously where and when tasks will be executed. These decisions are taken before the actual execution of any of the tasks.
– Dynamic scheduling: When tasks are already assigned, and depending on the behavior of the systems, initial decisions are reviewed and modified. Some of the tasks of the job could be already execution..
Distributed Operating Systems10
Fernando PérezJosé María Peña
Víctor RoblesFrancisco Rosales
Distributed Operating Systems
ProcessScheduling
ProcessScheduling
Static Scheduling
Fernando PérezJosé María Peña
Víctor RoblesFrancisco Rosales
Scheduler
wait
Scheduler
wait
Static Scheduling
• Generally, it is performed before allowing the job to enter in the system.
• The scheduler (sometimes called resource manager) selects a job from a waiting queue (depending on the policy) if there are enough resources, otherwise it waits.
JobsSystem
Job Queue
Resources? yes
no
Fernando PérezJosé María Peña
Víctor RoblesFrancisco Rosales
Job Descriptions
• In order to decide the order of job executions, the scheduler should have some information about the jobs:– Number of tasks– Priority– Tasks dependencies (DAG)– Estimation of the required resources (processors, memory, disk)– Estimation of the execution time (per task)– Other execution parameters– Any applicable restriction
• These definitions are included in a job description file. The format of this file depends on the specific scheduler.
Fernando PérezJosé María Peña
Víctor RoblesFrancisco Rosales
Scheduling Interdependent Tasks
• Considering the following aspects:– Task (estimated) duration– Size of data transited after task execution (e.g., file)– Task precedence (which task should be finish before other task
starts).– Restriction based on specific resource needs.
Distributed Operating Systems14
7
4
11 2
2
3
5
1
4
2
1
1
21
1
3
One option is to transform all data into the same measure (time)
Execution time (tasks)Transmission time (data)
Direct acyclic graph (DAG) description
Heterogeneous systems make this estimation more difficult:
Execution time depends on the processorCommunication time depends on the
connection
Fernando PérezJosé María Peña
Víctor RoblesFrancisco Rosales
Scheduling Interdependent Tasks
• Scheduling becomes the assignation of task to processors at a given timestamp:– There are some approximate heuristics for task belonging to single
job: Critical path assignation.– Polynomial time algorithms for 2-processors systems.– It is an NP-full problem for N>2.– The theoretical model is referred as multiprocessor scheduling.
Distributed Operating Systems15
Fernando PérezJosé María Peña
Víctor RoblesFrancisco Rosales
Example of Interdependent Tasks
Distributed Operating Systems16
1
Scheduler
5
2
43
10 10
20
1
1 1
11
20
5
N1 N2
2 1
3 4
5
N1 N20
10
30
36
Communication time between two tasks depends on the node they are running at:
Time ~0 if they are at the same nodeTime n if they are at different nodes.
Fernando PérezJosé María Peña
Víctor RoblesFrancisco Rosales
Cluster-based Algorithm
• For general cases the cluster-based algorithm is used:– Group tasks in clusters.– Assign one cluster to each processor.– Optimal assignment is NP-full– This model is valid for one or multiple jobs.– Cluster can be determined using:
• Lineal methods• Non-lineal methods• Heuristic/stochastic search
Distributed Operating Systems17
Fernando PérezJosé María Peña
Víctor RoblesFrancisco Rosales
Cluster-based Algorithm
Distributed Operating Systems18
22
2 3
2
2
4
3
5
2
1
3 1 1
212
3
1 3
123456789
10111213141516
A
BC
D E FG
I
H
A
B
G
CD
E
F
171819202122
H
I
Fernando PérezJosé María Peña
Víctor RoblesFrancisco Rosales
Replication
Distributed Operating Systems19
22
2 3
2
2
4
3
5
2
1
3 1 1
212
3
1 3
123456789
10111213141516
A
BC
D E FG
I
H
A
B
G
C1
D E
F
171819202122
H
I
Some tasks are executed in more than one node to avoid extra communication
C2
Fernando PérezJosé María Peña
Víctor RoblesFrancisco Rosales
Interdependent Task Migration
• Better resource usage can be achieved using migration:– Can also be planned by static scheduling.– It is more common in dynamic strategies.
Distributed Operating Systems20
Using migration 21
3
N1 N2
1 2
3
N1 N20
2
4
2
0
2
3
Fernando PérezJosé María Peña
Víctor RoblesFrancisco Rosales
• The following aspects should be considered:– Tasks should be executed in parallel– Tasks exchange message during their execution. – Local resources (memory, I/O) are required.
Parallel Task Scheduling
Distributed Operating Systems21
Different communicaton parameters:• Communicaton ration: Frequency and amount of
data.• Connection Topology: Where messages are sent
to/received from?• Communication Model: Synchronous (task are
waiting for data) or Asynchronous.
Restrictions• The existing physical topology of the networks• Network performance
Distributed Model
M
S1 S2 S3 S4 S5 S6
Centralized Model (Master/Slave)
Hypercube Ring
Fernando PérezJosé María Peña
Víctor RoblesFrancisco Rosales
Performance of Parallel Tasks
• Parallel tasks performance depends on:– Blocking conditions (internal load balancing)– System availability– Communication efficiency: latency and bandwidth
Distributed Operating Systems22
RunningBlocked
Idle
Non-blocking receive
Non-blocking sendBlocking receive
Synchronization barrier
Blocking send
Fernando PérezJosé María Peña
Víctor RoblesFrancisco Rosales
Distributed Operating Systems23
Parallel Tasks: Heterogeneity
• In some cases connection topologies are not regular:– Model by a directed/non-directed graph:– Each node represents a task with its own requirements of
memory/disk/CPU – Edges represents the amount of information exchange between a pair
of nodes (communication ratio).• Problem solution is NP-full• Some heuristics can be used:
– E.g., minimum cut: In the case of P nodes P-1 cut-points are selected (minimizing the information crossing each boundary)
– Result: Each partition (node) has a tightly-coupled group of tasks– Asynchronous problems are more complicated.– Problem balancing the load of each node.
Fernando PérezJosé María Peña
Víctor RoblesFrancisco Rosales
Distributed Operating Systems24
Parallel Tasks: Heterogeneity
3 2 3
2 2 1 8 5
34 1
5
4 2
6 4
N1 N2 N3
Inter-node connections13+17=30
3 2 3
2 2 1 8 5
34 1
5
4 2
46
N1 N2 N3
Inter-node connections: 13+15=28
Tanenbaum. “Distributed Operating Systems” © Prentice Hall 1996
Fernando PérezJosé María Peña
Víctor RoblesFrancisco Rosales
Scheduling Multiple Jobs
• When multiple jobs should be executed the schedule:– Selects the next job from the queue and sends it to the system.– Considers if there are available resources (e.g, processors) to
execute it.– Otherwise, it waits until some resources are released.
Distributed Operating Systems25
Scheduler
wait
Scheduler
waitJobs
System
Job Queue
Resources? yes
no
Fernando PérezJosé María Peña
Víctor RoblesFrancisco Rosales
Scheduling Multiple Jobs
• How the job is selected from the queue?:– FCFS (first-come-first-serve): Submission time is preserved.– SJF (shortest-job-first): The smallest job is selected. The size of the
job is measured by:• Resources, number of processors, or • Requested execution time (estimated by the user).
– LJF (longest-job-first): The oposite case.– Priority-based: The administrator could define some priority criteria,
such as:• Resource cost expenses.• Number of submitted jobs.• Job deadline. (EDF – Earliest-deadline-first)
Distributed Operating Systems26
Fernando PérezJosé María Peña
Víctor RoblesFrancisco Rosales
Backfilling
• Backfilling is variant of any of the previous policies:– If the selected job cannot execute
because there are not enough available resources, then,
– Search another job in the queue that requires less resources (thus, it could be executed).
– Increases system usage
Distributed Operating Systems27
SchedulerScheduler
Resources? yes
no
Backfilling
Searching jobs that require less processors
Fernando PérezJosé María Peña
Víctor RoblesFrancisco Rosales
Backfilling with Reservations
• Reservations are:– Calculate when the job could be
executed, based on the estimated execution time of the running jobs (deadlines)
– Jobs that require less jobs are backfilled, but only if they finish before the estimated deadline.
– System usage is not as efficient but avoids starvation in large jobs.
Distributed Operating Systems28
Backfilling may cause that large jobs are never scheduled
SchedulerScheduler
Resources? yes
no
Backfilling
Fernando PérezJosé María Peña
Víctor RoblesFrancisco Rosales
Distributed Operating Systems
Process Scheduling
Process Scheduling
Dynamic Scheduling
Fernando PérezJosé María Peña
Víctor RoblesFrancisco Rosales
Dynamic Scheduling
• Static scheduling decides whether a job is executed in the system or not, but afterwards there is no monitoring of the executing job.
• Dynamic scheduling:– Evaluates system status and makes corrective actions.– Resolves problems derived from the tasks parallelization (load-
balancing).– Reacts after system partial failures (node crashes).– Allows the system to be shared with other processes.– Requires a mechanism to monitor the system (task management
policies):• Considering the same resources evaluated by the static scheduling
Distributed Operating Systems30
Fernando PérezJosé María Peña
Víctor RoblesFrancisco Rosales
Distributed Operating Systems31
Load Balancing vs. Load Sharing
• Load Sharing:– Aim: Processor states should be the same.– Idle processors– A task waiting to be ran in a different processor
• Load Balancing:– Aim: Processor load should be the same.– Processor load changes during task execution– How load are calculated?
They are similar concepts, thus LS and LD use similar strategies but they are activated under different circumstances. LB has its own characteristics.
Assigned
Fernando PérezJosé María Peña
Víctor RoblesFrancisco Rosales
Distributed Operating Systems32
State/Load Measuring
• What’s an idle node?– Workstation: “several minutes with no keyboard/mouse input and
running no interactive process”– Calculation node: “no user process has been ran in a time frame.”
• What happens when it is no longer idle?– Nothing → New process experiment bad performance.– Process migration (complex)– Keep running under lower priority.
• If instead of the state (LS) it is necessary to know the load (LB) new measures are required.
Fernando PérezJosé María Peña
Víctor RoblesFrancisco Rosales
Distributed Operating Systems33
Task Management Policies
All the task management decisions are performed using several policies, to be defined for each problem or scenario:
• Information Policy: How information is distributed to take other decisions.
• Transfer Policy: When transference is performed. • Selection Policy: Which process is selected to be transferred.• Location Policy: Which node the process is transferred to.
Fernando PérezJosé María Peña
Víctor RoblesFrancisco Rosales
Distributed Operating Systems34
Information Policy
When node information is distributed:– On demand: Only when a transferral has to be done.– Periodically: Information is retrieved every sample window. The
information is always available when transfer, but is could be deprecated.
– On State Change: When node state has changed.
Distribution scope:– Complete: All nodes know complete system information.– Partial: Each node only knows partial system information.
Fernando PérezJosé María Peña
Víctor RoblesFrancisco Rosales
Distributed Operating Systems35
Information Policy
What information is distributed?:– Node Load What’s load meaning?
Different parameters:– %CPU in a given instant.– Number of processes ready to run (waiting)– Number of page faults / swapping– Considering several factors.
• In an heterogeneous system processes has different capabilities (parameters might change).
Fernando PérezJosé María Peña
Víctor RoblesFrancisco Rosales
Distributed Operating Systems36
Transfer Policy
• Usually, they are based on a threshold:– If node S load > T units, S is a process sending node.– If node S load < T units, S is a process receiving node.
• Transfer decisions:– Pre-emptive: partially executed task can be transferred.
• Process state is also transferred (migration)• Process execution restarts.
– No Pre-emptive: Running processes cannot be transferred.
Fernando PérezJosé María Peña
Víctor RoblesFrancisco Rosales
Distributed Operating Systems37
Selection Policy
• Choosing new processes, no under execution (no pre-emptive approach).
• Selecting processes with mining transfer costs (small state, minimum usage of local resources).
• Selecting processes only when their completion time will be less on the host node that on the original one (taking into account transfer time)..– Remote execution time should include migration time.– Execution time can be include in the job description (estimated by the
user).– Otherwise, the system should estimate it.
Fernando PérezJosé María Peña
Víctor RoblesFrancisco Rosales
Distributed Operating Systems38
Location Policy
• Sampling: Ask other nodes to know the most appropriate.• Alternatives:
– No sampling (randomly selected, hot-potato).– Sequential/parallel sampling.– Random sampling.– Closest nodes.– Broadcast sampling.– Based on previously gathered information (Information policy).
• Three possible policies:– Sender-driven (Push) → Process sender looking for nodes.– Receiver-driven (Pull) → Receiver searching for processes.– Combined → Sender/Receiver-driven.
Fernando PérezJosé María Peña
Víctor RoblesFrancisco Rosales
Distributed Operating Systems39
Remote Process Execution
• How is remote execution performed?– Create same execution environment:
• Environment variables, working directory, etc.– Redirect different OS calls to the original node:
• E.g. Console interaction• Migration (pre-emptive) more complex:
– “Freeze” process state– Transfer to the new host node– “Resume” process sate and execution
• Several complex issues:– Message and signal forwarding– Copy swap space or remote page-fault service?
Fernando PérezJosé María Peña
Víctor RoblesFrancisco Rosales
Distributed Operating Systems40
Process Migration
Different migration models:• Weak migration:
– Restricted to some applications (executed over virtual machines) or different checkpoints.
• Hard migration: – Native code migration and once the task has been started and
anytime during the execution.– General purpose: More flexible but more complex
• Data migration: – No process is migrated only working data are transferred.
Fernando PérezJosé María Peña
Víctor RoblesFrancisco Rosales
Migration: Task Data
• Data used by the task should be also migrated:– Data on disk: Common filesytem.– Data in memory: Requires “to froze” all data belonging to the task
(memory pages and processor records). Using checkpointing:• Memory pages are stored on disk.• It could be more selective if only (some) data pages are stored. It
requires the use of special libraries/languages.• It is necessary to store also messages sent but potentially not received.• Checkpointing it also helps if system fails (crash recovery).
Distributed Operating Systems41
Fernando PérezJosé María Peña
Víctor RoblesFrancisco Rosales
Weak Migration
• Weak migration can be performed by several methods:– Remote execution only when new processes are created:
• In UNIX could be dome while FORK or EXEC• New processes would be executed in the same node but it does not
provide load balancing.– Some state information should be sent even the task is not been
started yet:• Arguments, environment, open files used by the task, etc.
– Certain libraries allows programmer to define points in which state is stored/restored. These points can be used to migrate the process.
– Anyway, executable file should be accessible in the other node:• Common filesystem.
Distributed Operating Systems42
Fernando PérezJosé María Peña
Víctor RoblesFrancisco Rosales
Distributed Operating Systems43
Weak Migration
En languages (like Java):– There is a serialization mechanisms thet allow the system to transfer
object state in a “byte stream format”.– It provides a low overhead mechanism for dynamic on-demand class
loading form remote.
A=3
Process.class
instance
Node 1
A=3
Process.class
serialization
Node 2
Dynamicloading
Class request
Fernando PérezJosé María Peña
Víctor RoblesFrancisco Rosales
Distributed Operating Systems44
Hard Migration
• Naïve solution:– Copying memory map: text, data, stack, ...– Creating a new program control block [PCB] (with all the
information stored when the process changes its context).
• There are other data (stored by the kernel) that are required: Named external process state– Open files– Waiting signals– Sockets– Semaphores– Shared memory regions– .....
Fernando PérezJosé María Peña
Víctor RoblesFrancisco Rosales
Distributed Operating Systems45
Hard Migration
There are different approximations:• Kernel-based:
– Modified version of the kernel– All process information is available
• User-level:– Checkpointing libraries– Socket mobility protocols– System call interceptions
Other aspects:– Unique PID in the systems– Credentials and security issues
Fernando PérezJosé María Peña
Víctor RoblesFrancisco Rosales
Hard Migration
• One of the objectives is to resume process execution as soon as possible:– Copy all memory space to the new node– Copy only modified pages; the rest of the pages will be provided by
the swap area of the original node.– No previous copy; pages will be provided by the original node as
page faults happen:• served from memory if they are modified.• served from swap if they are not modified
– Swap out all memory pages in the original node and copy no page:pages will be served from original swap space.
– Prefetching: start coping pages while process are already executing.• Code (read-only) pages do not require migration:
– They are obtained by the remote node via a common filesystemDistributed Operating Systems46
Fernando PérezJosé María Peña
Víctor RoblesFrancisco Rosales
Benefits of Process Migration
• Better performance due to load-balancing• Get profit from resource proximity
– Task that uses frequently a remote resource is migrated to the node this resource is.
• Better performance in some client/server applications: – Minimize data transfer for large volumes of information:
• Server sends code instead of data (e.g., applets)• Client sends request code (e.g., dta base access queries)
• Fault tolerance when a partial failure happens• Development of “network applications”
– Applications that are created to be executed on a network– Applications explicitly request its own migration– Example: Mobile agents
Distributed Operating Systems47
Fernando PérezJosé María Peña
Víctor RoblesFrancisco Rosales
Distributed Operating Systems48
Data Migration
Used in master-slave applications.– Master: Distribute work among the slaves.– Slave: Worker (same code with different data).
• Represents a work distribution algorithm (using data to define tasks):– Avoid slaves to be idle because master is not providing data.– Do not schedule to much work to the same salve (final execution
time defined by the slowest)
– Solution: Dispatch work in blocks (in different sizes).
Fernando PérezJosé María Peña
Víctor RoblesFrancisco Rosales
Connection Sharing
• Some systems (e.g., web servers) consider workload as the number of incoming requests:– In these case the requests should be divided into several servers.– Problem: Server address should be unique.– Solution: Connection sharing:
• DNS forwarding• IP forwarding (NAT rewriting or encapsulation)• MAC forwarding
Distributed Operating Systems49
Fernando PérezJosé María Peña
Víctor RoblesFrancisco Rosales
Dynamic vs. Static Scheduling
• Systems might use any of them or even both.
Distributed Operating Systems50
Adaptive scheduling: All job submissions are centralized and scheduled, but continuous system monitoring is performed to react under misleading estimations and other unexpected circumstances.
Adaptive scheduling: All job submissions are centralized and scheduled, but continuous system monitoring is performed to react under misleading estimations and other unexpected circumstances.
Resource manager (batch scheduling): Processors are assigned to only one task at a time. The resource manager controls job submissions and keeps a log of teh assigned resources.
Resource manager (batch scheduling): Processors are assigned to only one task at a time. The resource manager controls job submissions and keeps a log of teh assigned resources.
StaticLoad-balancing strategies: Jobs are executed without restrictions in any node of the system. The system performs, in parallel, a load-balancing to re-distributed task in the different nodes.
Load-balancing strategies: Jobs are executed without restrictions in any node of the system. The system performs, in parallel, a load-balancing to re-distributed task in the different nodes.
No scheduling service in a cluster of computers: God will provide….No scheduling service in a cluster of computers: God will provide….
Non-static
Dyn
amic
Non
-Dyn
amic
Fernando PérezJosé María Peña
Víctor RoblesFrancisco Rosales
Distributed Operating Systems
ProcessScheduling
ProcessScheduling
Fernando PérezJosé María Peña
Víctor RoblesFrancisco Rosales
Distributed Operating Systems52
Computing Platforms
• Depending of the preferred use of the platform:– Autonomous computers from independent users
• User share the computer but only when it is idle• What happens when it is not idle anymore?
– Task migration to other nodes– Keep executing task with lower priority
– Dedicated system for parallel executions• A priori scheduling techniques are possible• Alternatively the behavior of the system cam be adapted dynamically• Optimizing either application execution time or resource usage.
– General distributed systems (multiple users and multiple applications)• The goal is to achieve a well-balanced load distribution.
Fernando PérezJosé María Peña
Víctor RoblesFrancisco Rosales
Distributed Operating Systems53
Cluster Taxonomy
• High Performance Clusters• Beowulf; parallel programs; MPI; dedicated facilities
• High Availability Clusters• ServiceGuard, Lifekeeper, Failsafe, heartbeat
• High Throughput Clusters• Workload/resource managers; load-balancing; supercomputing services
• Based on application domain:– Web-Service Clusters
• LVS/Piranha; balancing TCP connections; replicated data– Storage Clusters
• GFS; paralle filesystems; common data view from all the nodes– Database Clusters
• Oracle Parallel Server;