1 Distributed Scheduling In Sombrero, A Single Address Space Distributed Operating System Milind...

transcript

Distributed Scheduling In Sombrero, A Distributed Scheduling In Sombrero, A Single Address Space Distributed Single Address Space Distributed

Operating SystemOperating System

Milind Patil

ContentsContents• Distributed Scheduling• Features of Sombrero• Goals• Related Work• Platform for Distributed Scheduling• Distributed Scheduling Algorithm (Simulation)• Scaling of the Algorithm (Simulation) • Initiation of Porting to Sombrero Prototype• Testing• Conclusion• Future Work

Distributed SchedulingDistributed Scheduling

• A distributed scheduling algorithm provides for sharing as well as better usage of resources across the system.

• The algorithm will allow threads in the distributed system to be scheduled among the different processors in such a manner that CPU usage is balanced.

Features of SombreroFeatures of Sombrero

Distributed scheduling in Sombrero takes advantage of the distributed SASOS features:

• The shared memory inherent to a distributed SASOS provides an excellent mechanism to distribute load information of the nodes in the system (information policy).

• The ability of threads to migrate in a simple manner across machines has a potentially far-reaching affect on the performance of the distributed scheduling mechanism.

Features of Sombrero (contd.)Features of Sombrero (contd.)

• The granularity of migration is a thread not a process. This allows the distributed scheduling algorithm to have a flexible selection policy (determines which thread is to be transferred to achieve load balancing).

• This feature also reduces the software complexity of the algorithm.

GoalsGoals

• Platform for Distributed Scheduling

• Simulation of Distributed Scheduling Algorithm

• Scaling of the Algorithm (Simulation)

• Initiation of Porting to Sombrero Prototype

Related WorkRelated Work

Load-Balancing Algorithms for

• SpriteSprite

• PVMPVM

• Condor Condor

• UNIX UNIX

RequirementsRequirements

•A working prototype of Sombrero is needed that has the ability to manage extremely large data sets across a network in a distributed single address space.

•A functional prototype is needed which implements essential features such as protections domains, Sombrero thread support, token tracking support, etc.

The prototype is under construction and not available as development platform. Windows NT is used since the prototype is being developed on it.

Sombrero NodeSombrero Node

Load Table

Sombrero Node

Local ThreadInformation

Selection Policy

Communication Thread

Distributed Scheduler

Sombrero Node

Local ThreadInformation

Selection Policy

Communication Thread

Distributed Scheduler

ThreadMigration

Architecture of Sombrero Nodes

Sombrero ClustersSombrero Clusters

RMOCB0x5000

RMOCB0x7000

RMOCB0x6000

RMOCB 0x1000

RMOCB0x2000

Router0x1

Router0x11

Cluster I

RMOCB0x3000

RMOCB0x4000

Cluster II

Cluster III

Load Table0x1000

Load Table0x5000

Load Table0x2000

The Sombrero system is organized into hierarchies of clusters for scalable distributed scheduling.

Sombrero RouterSombrero Router

Architecture of Sombrero Routers

I/O Completion Port

Service Threads

Socket Socket Socket Socket Socket SocketSocket

Inter-node CommunicationInter-node Communication

Sombrero nodes communicate with each other through the routers.

RMOCB 0x1000

RMOCB0x2000

RMOCB0x3000

Router0x1

Router0x11

Cluster I Cluster II

Router TablesRouter Tables

Router 0x1RMOCB SUBNET MASK REMOTE IP ADDRESS

0 0xC000000000000000 NULL …

0x8000000000000000 0x8000000000000000 NULL …

0x3 xx …

RMOCB SUBNET MASK

0x4000000000000000 0xC000000000000000

1 xxC D

A :B :R3:

Router Tables(contd.)Router Tables(contd.)

Router 0x3RMOCB SUBNET MASK REMOTE IP ADDRESS

0x4000000000000000 0xE000000000000000 NULL …

0x6000000000000000 0xE000000000000000 NULL …

0x1 xx …

RMOCB SUBNET MASK

0 0xC000000000000000

0x8000000000000000 0x8000000000000000

C :D :R1:

Address Space AllocationAddress Space Allocation

This project implements an address space allocation mechanism to distribute the 264 bytes address space amongst the nodes in the system.

Example:- Consider a system of four Sombrero nodes (A, B, C and D). The nodes come online for the very first time in the order - A, B , C and D.

•The address space allocated for the nodes when A is initialized will be:A: 0x0000000000000000 – 0xfffffffffffffff

•The address space allocated for the nodes when B is initialized will be:A: 0x0000000000000000 – 0x7fffffffffffffffB: 0x8000000000000000 – 0xffffffffffffffff

Address Space Allocation(contd.)Address Space Allocation(contd.)

•The address space allocated for the nodes when C is initialized will be:A: 0x0000000000000000 – 0x3fffffffffffffffB: 0x8000000000000000 – 0xffffffffffffffffC: 0x4000000000000000 – 0x7fffffffffffffff•The address space allocated for the nodes when D is initialized will be:A: 0x0000000000000000 – 0x3fffffffffffffffB: 0x8000000000000000 – 0xffffffffffffffffC: 0x4000000000000000 – 0x5fffffffffffffffD: 0x6000000000000000 – 0x7fffffffffffffff

Address Space Allocation(contd.)Address Space Allocation(contd.)

Load MeasurementLoad Measurement

A node’s workload can be estimated based on some

measurable parameters:•Total number of threads on the node at the time of load

measurement.•Instruction mixes of these threads (I/O bound or CPU

bound).

Load Measurement (contd.)Load Measurement (contd.)

p processor utilization of a thread f heuristic factor (adjusts the importance of thread depending on how it is being used)

The heuristic factor ‘f’ should have a large value for I/O intensive threads and a small value for CPU intensive threads. The values of the heuristic factor can be empirically determined by using a fully functional Sombrero prototype.

Work Load = i (pi fi)

Load Measurement - SimulationLoad Measurement - Simulation

•In the simulation we assume that the processor utilization of all threads is the same:This is sufficient to prove the correctness of the algorithm•The measure of load at the node level is the number of Sombrero threads.•A threshold policy has been defined:

high--number of Sombrero threads HIGHLOADlow--number of Sombrero threads < MEDIUMLOADmedium--number of Sombrero threads < HIGHLOAD and number of Sombrero threads MEDIUMLOAD

Load TablesLoad Tables

• Shared memory is used to distribute load information. (In Sombrero the shared memory consistency is managed by the token tracking mechanism)

• One load table is needed for each cluster.

• Thresholds of load have been established to minimize the exchange of load information in the network. Only threshold crossings are recorded in the load table.

Distributed Scheduling AlgorithmDistributed Scheduling Algorithm

Highly loaded nodes in minority

Sender Initiated Algorithm

Lightly loaded nodes in minority

Receiver Initiated Algorithm

Highly loaded nodes

Lightly loaded nodes

Medium loaded nodes are not considered

The algorithm used is dynamic i.e. sender initiated at lower

loads and receiver initiated at higher loads.

1. Nodes loaded in the medium range do not participate in

load balancing.

2. The load balancing is not to be done if the node belongs

to the majority (larger of the groups of highly or lightly

loaded nodes).

3. Load balancing is to be done if node belongs to the minority (smaller of the groups of highly or lightly loaded nodes).

The node is heavily loaded and the algorithm is sender initiated:- choose a lightly loaded node at random and the RGETTHREADS message protocol is followed for thread migration.The node is lightly loaded and the algorithm is receiver initiated:- choose a highly loaded node at random and the GETTHREADS message protocol is followed for thread migration.

Scaling the AlgorithmScaling the Algorithm

•Aggregating the clusters provides scalability.•Thresholds for clusters are defined as given:high: - no cluster members are lightly loaded and at least one member is highly loadedlow: - no cluster members are highly loaded and at least one member is lightly loadedmedium: - all other cases of loads where load balancing can occur within the cluster members or when all members of the cluster are medium loaded

Scaling the AlgorithmScaling the Algorithm

1. At any level of cluster only the nodes belonging to the minority group at that level will be active. 2. Load balancing at an nth level cluster will be attempted every (nSOMECONSTANT) times the number of unsuccessful attempts at the node level.3. A suitable nth level target cluster is found through the corresponding load table and the TRANSFERREQUEST message protocol is followed for thread migration.

…... …... …... …... …...n=1

Testing Eight NodesTesting Eight Nodes

Table 1. Testing Eight Nodes

Cluster

(Before Load Balancing)

Cluster

(After Load Balancing)

Messages

[1,0,7] [0,1,7] 3

[2,0,6] [0,2,6] 6

[3,0,5] [0,3,5] 9

[4,0,4] [0,4,4] 12

[5,0,3] [0,5,3] 10

[6,0,2] [0,6,2] 12

[7,0,1] [0,7,1] 14

Cluster: [# of highly loaded nodes, # of medium loaded nodes, # of lightly loaded nodes]

Testing Three ClustersTesting Three Clusters

Table 5. Testing Three Clusters

Before Load Balancing After Load BalancingLoad

Cluster I Cluster II Cluster III Cluster I Cluster II Cluster III

Messages

{L,L,L} [2,0,6] [2,0,6] [2,0,6] [0,2,6] [0,2,6] [0,2,6] 18

{L,L,M} [2,0,6] [2,0,6] [0,8,0] [0,2,6] [0,2,6] [0,8,0] 12

{L,M,M} [2,0,6] [0,8,0] [0,8,0] [0,2,6] [0,8,0] [0,8,0] 6

{L,L,H} [0,0,8] [0,0,8] [4,4,0] [0,0,8] [0,0,8] [0,8,0] 14

{L,M,H} [0,0,8] [0,8,0] [1,7,0] [0,0,8] [0,8,0] [0,8,0] 5

{L,H,H} [0,0,8] [8,0,0] [1,7,0] [0,0,8] [0,8,0] [0,8,0] 31

{M,H,H} [0,8,0] [8,0,0] [8,0,0] [0,8,0] [8,0,0] [8,0,0] -

…... …... …...n=1

Testing Six Clusters at Two LevelsTesting Six Clusters at Two Levels

… … … … … …Table 6. Testing Six Clusters at Two Levels

Cluster A Cluster B

I II III IV V VI

Messages

[1,7,0] H [0,8,0] M [0,8,0] M [0,0,8] L [0,8,0] M [0,8,0] M[0,8,0] M [0,8,0] M [0,8,0] M [0,0,8] L [0,8,0] M [0,8,0] M

[2,6,0] H [0,8,0] M [0,8,0] M [0,0,8] L [0,8,0] M [0,8,0] M[0,8,0] M [0,8,0] M [0,8,0] M [0,0,8] L [0,8,0] M [0,8,0] M

[6,2,0] H [0,8,0] M [0,8,0] M [0,0,8] L [0,8,0] M [0,8,0] M[0,8,0] M [0,8,0] M [0,8,0] M [0,0,8] L [0,8,0] M [0,8,0] M

[7,1,0] H [0,8,0] M [0,8,0] M [0,0,8] L [0,8,0] M [0,8,0] M[0,8,0] M [0,8,0] M [0,8,0] M [0,0,8] L [0,8,0] M [0,8,0] M

[8,0,0] H [8,0,0] H [2,6,0] H [0,0,8] L [0,8,0] M [0,8,0] M[0,8,0] M [0,8,0] M [0,8,0] M [0,0,8] L [0,8,0] M [0,8,0] M

[8,0,0] H [8,0,0] H [3,5,0] H [0,0,8] L [0,8,0] M [0,8,0] M[0,8,0] M [0,8,0] M [0,8,0] M [0,0,8] L [0,8,0] M [0,8,0] M

[8,0,0] H [8,0,0] H [8,0,0] H [0,0,8] L [0,8,0] M [0,8,0] M[0,8,0] M [0,8,0] M [0,8,0] M [0,0,8] L [0,8,0] M [0,8,0] M

Before

ConclusionConclusion

•The testing of distributed scheduling using the simulator verifies that the algorithm functions correctly.

•It is observed that the increase in number of messages is proportional to the increase in number of heavily loaded nodes.

•The number of messages required for load balancing at the first level and above is the same if the ratio of heavily and lightly loaded nodes is kept constant at both levels.

Conclusion (contd.)Conclusion (contd.)

•Only one additional load table is required per additional cluster. Hence, the required number of messages is expected to increase by a small constant factor as the level of clustering increases.

•It can be concluded that the algorithm’s complexity is O(n) where n is the number of highly loaded nodes.

Future WorkFuture Work

•Porting of code from NT to Sombrero for the Sombrero node - communication code.

•Changing definition of load measurement to the more general formula.

•Reuse code from the Sombrero router.

•Adaptive cluster forming algorithm.

AcknowledgementsAcknowledgements

Dr. Donald Miller

Dr. Rida Bazzi

Dr. Bruce Millard

Mr. Alan Skousen

Mr. Raghavendra Hebbalalu

Mr. Ravikanth Nasika

Mr. Tom Boyd

1 Distributed Scheduling In Sombrero, A Single Address Space Distributed Operating System Milind...

Documents