Process Scheduling - UPMlaurel.datsi.fi.upm.es/~jmpena/INSA/DS-2-process_management.pdf ·...

Distributed Operating Systems

ProcessScheduling

ProcessScheduling

Fernando PérezJosé María Peña

Víctor RoblesFrancisco Rosales

Process Management

1. Concepts and Taxonomies: Jobs and parallel/distributed systems

2. Static scheduling: Scheduling dependent tasksScheduling parallel tasksScheduling tasks from multiple jobs

3. Dynamic scheduling:Load balancingProcess migrationData migrationConnection balancing



Distributed Operating Systems3

Starting Scenario: Concepts

• Jobs: Set of tasks (processes or threads) that require (resources x time)– Resource: Data, devices, CPU or other required (finite) elements to

carry out a job.– Time: Period when resources are assigned (shared or dedicated) to a

certain job.– Task relationship: Tasks should be performed in order keeping the

restrictions based on needed input or resources. • Scheduling: Assigning jobs and their tasks to computational

resources (specially CPU). Scheduling could be monitored, reviewed and changed along time.




Starting Scenario

Required Resources

JobsTasks

Nodes(Processors)

GOALTo assign users’ jobs to the nodes, with the objective to improve sequential performance.



Scheduling Characteristics

• Shared memory systems– Any of the processor can access the resources used by a task:

• Memory space• Internal OS resources (files, connections, etc.)

– Automatic load sharing/balancing:• Free processors execute any process (task) in ready status

– Improvements derived from efficient process managements:• Better resource usage and performance• Parallel application uses available processors

• Distributed systems– Tasks are assigned to a processor for their whole running time– The resources used by a task can only be accessed from the local

processors.– Load balancing requires process migration





Starting Scenario: Jobs

What do we execute? Jobs are divided into tasks:• Independent tasks

– Independent processes– They could belong to different users

• Coopering tasks– They interact somehow– Belonging to the same application– There can be some dependencies– Or they can require parallel execution




Coopering Tasks

Task dependencies• Model based on a direct acyclic

graph (DAG).

• Example: Workflow

Parallel execution• It requires a number of parallel

executing tasks at the same moment:– Synchronous or asynchronous

interactions.– Based on a connection topology.– Either master/slave or fully

distributed model.– Particular communication ratios and

message exchanges.

• Example: MPI code

Tasks

Data transferred




Starting Scenarios: Objectives

What kind of “better performance” are we expecting?System taxonomy.• High-availability systems

– HAS: High Availability Systems– Service should be always working– Fault tolerance

• High-performance systems – HPC: High Performance Computing– Reaching a higher computational power– Execution one heavy job in less time.

• High-throughput systems – HTS: High Throughput Systems– Number of executed jobs should be maximized– Optimizing the resource usage or the number of clients (could represent a different

objective).



Scheduling

• Scheduling deals with the distribution of tasks on a distributedcomputing platform:– Attending to resource requirements– Attending to task inter-dependencies

• Final performance depends on:– Concurrency: Maximum number of processors running in parallel.– Parallelism degree: The smallest degree in which a parallel job can

be divided into tasks.– Communication costs: Could be different between processors in the

same node than different nodes.– Shared resources: Common resources (like memory) shared among

all the tasks running in the same node.



Scheduling

• Processor usage:– Exclusive: One task to one processor.– Shared: If the tasks performs few I/O phases performance is limited.

It is not the usual strategy.• Task scheduling can be planned as:

– Static Scheduling: The system decides previously where and when tasks will be executed. These decisions are taken before the actual execution of any of the tasks.

– Dynamic scheduling: When tasks are already assigned, and depending on the behavior of the systems, initial decisions are reviewed and modified. Some of the tasks of the job could be already execution..





ProcessScheduling

ProcessScheduling

Static Scheduling



Scheduler

wait

Scheduler

wait

Static Scheduling

• Generally, it is performed before allowing the job to enter in the system.

• The scheduler (sometimes called resource manager) selects a job from a waiting queue (depending on the policy) if there are enough resources, otherwise it waits.

JobsSystem

Job Queue

Resources? yes

no



Job Descriptions

• In order to decide the order of job executions, the scheduler should have some information about the jobs:– Number of tasks– Priority– Tasks dependencies (DAG)– Estimation of the required resources (processors, memory, disk)– Estimation of the execution time (per task)– Other execution parameters– Any applicable restriction

• These definitions are included in a job description file. The format of this file depends on the specific scheduler.



Scheduling Interdependent Tasks

• Considering the following aspects:– Task (estimated) duration– Size of data transited after task execution (e.g., file)– Task precedence (which task should be finish before other task

starts).– Restriction based on specific resource needs.


7

4

11 2

2

3

5

1

4

2

1

1

21

1

3

One option is to transform all data into the same measure (time)

Execution time (tasks)Transmission time (data)

Direct acyclic graph (DAG) description

Heterogeneous systems make this estimation more difficult:

Execution time depends on the processorCommunication time depends on the

connection



Scheduling Interdependent Tasks

• Scheduling becomes the assignation of task to processors at a given timestamp:– There are some approximate heuristics for task belonging to single

job: Critical path assignation.– Polynomial time algorithms for 2-processors systems.– It is an NP-full problem for N>2.– The theoretical model is referred as multiprocessor scheduling.




Example of Interdependent Tasks


1

Scheduler

5

2

43

10 10

20

1

1 1

11

20

5

N1 N2

2 1

3 4

5

N1 N20

10

30

36

Communication time between two tasks depends on the node they are running at:

Time ~0 if they are at the same nodeTime n if they are at different nodes.



Cluster-based Algorithm

• For general cases the cluster-based algorithm is used:– Group tasks in clusters.– Assign one cluster to each processor.– Optimal assignment is NP-full– This model is valid for one or multiple jobs.– Cluster can be determined using:

• Lineal methods• Non-lineal methods• Heuristic/stochastic search




Cluster-based Algorithm


22

2 3

2

2

4

3

5

2

1

3 1 1

212

3

1 3

123456789

10111213141516

A

BC

D E FG

I

H

A

B

G

CD

E

F

171819202122

H

I



Replication


22

2 3

2

2

4

3

5

2

1

3 1 1

212

3

1 3

123456789

10111213141516

A

BC

D E FG

I

H

A

B

G

C1

D E

F

171819202122

H

I

Some tasks are executed in more than one node to avoid extra communication

C2



Interdependent Task Migration

• Better resource usage can be achieved using migration:– Can also be planned by static scheduling.– It is more common in dynamic strategies.


Using migration 21

3

N1 N2

1 2

3

N1 N20

2

4

2

0

2

3



• The following aspects should be considered:– Tasks should be executed in parallel– Tasks exchange message during their execution. – Local resources (memory, I/O) are required.

Parallel Task Scheduling


Different communicaton parameters:• Communicaton ration: Frequency and amount of

data.• Connection Topology: Where messages are sent

to/received from?• Communication Model: Synchronous (task are

waiting for data) or Asynchronous.

Restrictions• The existing physical topology of the networks• Network performance

Distributed Model

M

S1 S2 S3 S4 S5 S6

Centralized Model (Master/Slave)

Hypercube Ring



Performance of Parallel Tasks

• Parallel tasks performance depends on:– Blocking conditions (internal load balancing)– System availability– Communication efficiency: latency and bandwidth


RunningBlocked

Idle

Non-blocking receive

Non-blocking sendBlocking receive

Synchronization barrier

Blocking send




Parallel Tasks: Heterogeneity

• In some cases connection topologies are not regular:– Model by a directed/non-directed graph:– Each node represents a task with its own requirements of

memory/disk/CPU – Edges represents the amount of information exchange between a pair

of nodes (communication ratio).• Problem solution is NP-full• Some heuristics can be used:

– E.g., minimum cut: In the case of P nodes P-1 cut-points are selected (minimizing the information crossing each boundary)

– Result: Each partition (node) has a tightly-coupled group of tasks– Asynchronous problems are more complicated.– Problem balancing the load of each node.




Parallel Tasks: Heterogeneity

3 2 3

2 2 1 8 5

34 1

5

4 2

6 4

N1 N2 N3

Inter-node connections13+17=30

3 2 3

2 2 1 8 5

34 1

5

4 2

46

N1 N2 N3

Inter-node connections: 13+15=28

Tanenbaum. “Distributed Operating Systems” © Prentice Hall 1996



Scheduling Multiple Jobs

• When multiple jobs should be executed the schedule:– Selects the next job from the queue and sends it to the system.– Considers if there are available resources (e.g, processors) to

execute it.– Otherwise, it waits until some resources are released.


Scheduler

wait

Scheduler

waitJobs

System

Job Queue

Resources? yes

no



Scheduling Multiple Jobs

• How the job is selected from the queue?:– FCFS (first-come-first-serve): Submission time is preserved.– SJF (shortest-job-first): The smallest job is selected. The size of the

job is measured by:• Resources, number of processors, or • Requested execution time (estimated by the user).

– LJF (longest-job-first): The oposite case.– Priority-based: The administrator could define some priority criteria,

such as:• Resource cost expenses.• Number of submitted jobs.• Job deadline. (EDF – Earliest-deadline-first)




Backfilling

• Backfilling is variant of any of the previous policies:– If the selected job cannot execute

because there are not enough available resources, then,

– Search another job in the queue that requires less resources (thus, it could be executed).

– Increases system usage


SchedulerScheduler

Resources? yes

no

Backfilling

Searching jobs that require less processors



Backfilling with Reservations

• Reservations are:– Calculate when the job could be

executed, based on the estimated execution time of the running jobs (deadlines)

– Jobs that require less jobs are backfilled, but only if they finish before the estimated deadline.

– System usage is not as efficient but avoids starvation in large jobs.


Backfilling may cause that large jobs are never scheduled

SchedulerScheduler

Resources? yes

no

Backfilling




Process Scheduling

Process Scheduling

Dynamic Scheduling



Dynamic Scheduling

• Static scheduling decides whether a job is executed in the system or not, but afterwards there is no monitoring of the executing job.

• Dynamic scheduling:– Evaluates system status and makes corrective actions.– Resolves problems derived from the tasks parallelization (load-

balancing).– Reacts after system partial failures (node crashes).– Allows the system to be shared with other processes.– Requires a mechanism to monitor the system (task management

policies):• Considering the same resources evaluated by the static scheduling





Load Balancing vs. Load Sharing

• Load Sharing:– Aim: Processor states should be the same.– Idle processors– A task waiting to be ran in a different processor

• Load Balancing:– Aim: Processor load should be the same.– Processor load changes during task execution– How load are calculated?

They are similar concepts, thus LS and LD use similar strategies but they are activated under different circumstances. LB has its own characteristics.

Assigned




State/Load Measuring

• What’s an idle node?– Workstation: “several minutes with no keyboard/mouse input and

running no interactive process”– Calculation node: “no user process has been ran in a time frame.”

• What happens when it is no longer idle?– Nothing → New process experiment bad performance.– Process migration (complex)– Keep running under lower priority.

• If instead of the state (LS) it is necessary to know the load (LB) new measures are required.




Task Management Policies

All the task management decisions are performed using several policies, to be defined for each problem or scenario:

• Information Policy: How information is distributed to take other decisions.

• Transfer Policy: When transference is performed. • Selection Policy: Which process is selected to be transferred.• Location Policy: Which node the process is transferred to.




Information Policy

When node information is distributed:– On demand: Only when a transferral has to be done.– Periodically: Information is retrieved every sample window. The

information is always available when transfer, but is could be deprecated.

– On State Change: When node state has changed.

Distribution scope:– Complete: All nodes know complete system information.– Partial: Each node only knows partial system information.




Information Policy

What information is distributed?:– Node Load What’s load meaning?

Different parameters:– %CPU in a given instant.– Number of processes ready to run (waiting)– Number of page faults / swapping– Considering several factors.

• In an heterogeneous system processes has different capabilities (parameters might change).




Transfer Policy

• Usually, they are based on a threshold:– If node S load > T units, S is a process sending node.– If node S load < T units, S is a process receiving node.

• Transfer decisions:– Pre-emptive: partially executed task can be transferred.

• Process state is also transferred (migration)• Process execution restarts.

– No Pre-emptive: Running processes cannot be transferred.




Selection Policy

• Choosing new processes, no under execution (no pre-emptive approach).

• Selecting processes with mining transfer costs (small state, minimum usage of local resources).

• Selecting processes only when their completion time will be less on the host node that on the original one (taking into account transfer time)..– Remote execution time should include migration time.– Execution time can be include in the job description (estimated by the

user).– Otherwise, the system should estimate it.




Location Policy

• Sampling: Ask other nodes to know the most appropriate.• Alternatives:

– No sampling (randomly selected, hot-potato).– Sequential/parallel sampling.– Random sampling.– Closest nodes.– Broadcast sampling.– Based on previously gathered information (Information policy).

• Three possible policies:– Sender-driven (Push) → Process sender looking for nodes.– Receiver-driven (Pull) → Receiver searching for processes.– Combined → Sender/Receiver-driven.




Remote Process Execution

• How is remote execution performed?– Create same execution environment:

• Environment variables, working directory, etc.– Redirect different OS calls to the original node:

• E.g. Console interaction• Migration (pre-emptive) more complex:

– “Freeze” process state– Transfer to the new host node– “Resume” process sate and execution

• Several complex issues:– Message and signal forwarding– Copy swap space or remote page-fault service?




Process Migration

Different migration models:• Weak migration:

– Restricted to some applications (executed over virtual machines) or different checkpoints.

• Hard migration: – Native code migration and once the task has been started and

anytime during the execution.– General purpose: More flexible but more complex

• Data migration: – No process is migrated only working data are transferred.



Migration: Task Data

• Data used by the task should be also migrated:– Data on disk: Common filesytem.– Data in memory: Requires “to froze” all data belonging to the task

(memory pages and processor records). Using checkpointing:• Memory pages are stored on disk.• It could be more selective if only (some) data pages are stored. It

requires the use of special libraries/languages.• It is necessary to store also messages sent but potentially not received.• Checkpointing it also helps if system fails (crash recovery).




Weak Migration

• Weak migration can be performed by several methods:– Remote execution only when new processes are created:

• In UNIX could be dome while FORK or EXEC• New processes would be executed in the same node but it does not

provide load balancing.– Some state information should be sent even the task is not been

started yet:• Arguments, environment, open files used by the task, etc.

– Certain libraries allows programmer to define points in which state is stored/restored. These points can be used to migrate the process.

– Anyway, executable file should be accessible in the other node:• Common filesystem.





Weak Migration

En languages (like Java):– There is a serialization mechanisms thet allow the system to transfer

object state in a “byte stream format”.– It provides a low overhead mechanism for dynamic on-demand class

loading form remote.

A=3

Process.class

instance

Node 1

A=3

Process.class

serialization

Node 2

Dynamicloading

Class request




Hard Migration

• Naïve solution:– Copying memory map: text, data, stack, ...– Creating a new program control block [PCB] (with all the

information stored when the process changes its context).

• There are other data (stored by the kernel) that are required: Named external process state– Open files– Waiting signals– Sockets– Semaphores– Shared memory regions– .....




Hard Migration

There are different approximations:• Kernel-based:

– Modified version of the kernel– All process information is available

• User-level:– Checkpointing libraries– Socket mobility protocols– System call interceptions

Other aspects:– Unique PID in the systems– Credentials and security issues



Hard Migration

• One of the objectives is to resume process execution as soon as possible:– Copy all memory space to the new node– Copy only modified pages; the rest of the pages will be provided by

the swap area of the original node.– No previous copy; pages will be provided by the original node as

page faults happen:• served from memory if they are modified.• served from swap if they are not modified

– Swap out all memory pages in the original node and copy no page:pages will be served from original swap space.

– Prefetching: start coping pages while process are already executing.• Code (read-only) pages do not require migration:

– They are obtained by the remote node via a common filesystemDistributed Operating Systems46



Benefits of Process Migration

• Better performance due to load-balancing• Get profit from resource proximity

– Task that uses frequently a remote resource is migrated to the node this resource is.

• Better performance in some client/server applications: – Minimize data transfer for large volumes of information:

• Server sends code instead of data (e.g., applets)• Client sends request code (e.g., dta base access queries)

• Fault tolerance when a partial failure happens• Development of “network applications”

– Applications that are created to be executed on a network– Applications explicitly request its own migration– Example: Mobile agents





Data Migration

Used in master-slave applications.– Master: Distribute work among the slaves.– Slave: Worker (same code with different data).

• Represents a work distribution algorithm (using data to define tasks):– Avoid slaves to be idle because master is not providing data.– Do not schedule to much work to the same salve (final execution

time defined by the slowest)

– Solution: Dispatch work in blocks (in different sizes).



Connection Sharing

• Some systems (e.g., web servers) consider workload as the number of incoming requests:– In these case the requests should be divided into several servers.– Problem: Server address should be unique.– Solution: Connection sharing:

• DNS forwarding• IP forwarding (NAT rewriting or encapsulation)• MAC forwarding




Dynamic vs. Static Scheduling

• Systems might use any of them or even both.


Adaptive scheduling: All job submissions are centralized and scheduled, but continuous system monitoring is performed to react under misleading estimations and other unexpected circumstances.

Adaptive scheduling: All job submissions are centralized and scheduled, but continuous system monitoring is performed to react under misleading estimations and other unexpected circumstances.

Resource manager (batch scheduling): Processors are assigned to only one task at a time. The resource manager controls job submissions and keeps a log of teh assigned resources.

Resource manager (batch scheduling): Processors are assigned to only one task at a time. The resource manager controls job submissions and keeps a log of teh assigned resources.

StaticLoad-balancing strategies: Jobs are executed without restrictions in any node of the system. The system performs, in parallel, a load-balancing to re-distributed task in the different nodes.

Load-balancing strategies: Jobs are executed without restrictions in any node of the system. The system performs, in parallel, a load-balancing to re-distributed task in the different nodes.

No scheduling service in a cluster of computers: God will provide….No scheduling service in a cluster of computers: God will provide….

Non-static

Dyn

amic

Non

-Dyn

amic




ProcessScheduling

ProcessScheduling




Computing Platforms

• Depending of the preferred use of the platform:– Autonomous computers from independent users

• User share the computer but only when it is idle• What happens when it is not idle anymore?

– Task migration to other nodes– Keep executing task with lower priority

– Dedicated system for parallel executions• A priori scheduling techniques are possible• Alternatively the behavior of the system cam be adapted dynamically• Optimizing either application execution time or resource usage.

– General distributed systems (multiple users and multiple applications)• The goal is to achieve a well-balanced load distribution.




Cluster Taxonomy

• High Performance Clusters• Beowulf; parallel programs; MPI; dedicated facilities

• High Availability Clusters• ServiceGuard, Lifekeeper, Failsafe, heartbeat

• High Throughput Clusters• Workload/resource managers; load-balancing; supercomputing services

• Based on application domain:– Web-Service Clusters

• LVS/Piranha; balancing TCP connections; replicated data– Storage Clusters

• GFS; paralle filesystems; common data view from all the nodes– Database Clusters

• Oracle Parallel Server;

Date post:	20-Jun-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

Process Scheduling - UPMlaurel.datsi.fi.upm.es/~jmpena/INSA/DS-2-process_management.pdf ·...

Documents