Date post: | 02-Jan-2016 |
Category: |
Documents |
Upload: | sacha-conrad |
View: | 27 times |
Download: | 0 times |
11
Cluster Operating System Cluster Operating System Support ForSupport For
Parallel Autonomic Parallel Autonomic Computing Computing
Andrzej M. Goscinski, Andrzej M. Goscinski, J. Silcock, M. HobbsJ. Silcock, M. Hobbs
School of Information TechnologySchool of Information TechnologyDeakin UniversityDeakin University
Geelong, Vic 3217, AustraliaGeelong, Vic 3217, Australia
June 2004June 2004 COSET’2004COSET’2004 22
A Need for More than A Need for More than Execution PerformanceExecution Performance
Performance is a critical assessment criterionPerformance is a critical assessment criterion Security, reliability, and ease of programming Security, reliability, and ease of programming
are neglectedare neglected FurthermoreFurthermore
– Parallel computers are seen as being user unfriendlyParallel computers are seen as being user unfriendly– Parallel processing is not used on daily basisParallel processing is not used on daily basis– Ordinary users have to be involved in programming Ordinary users have to be involved in programming
activities that are of the operating system natureactivities that are of the operating system nature– Ordinary engineers, managers, etc do not have, and Ordinary engineers, managers, etc do not have, and
should not have, specialized knowledge needed to should not have, specialized knowledge needed to program operating system oriented activitiesprogram operating system oriented activities
June 2004June 2004 COSET’2004COSET’2004 33
Aim of Our ResearchAim of Our Research
IBM has launched a comprehensive program IBM has launched a comprehensive program – ““to re-examine an obsession with faster, smaller, and to re-examine an obsession with faster, smaller, and
more powerful” more powerful” – ““to look at the evolution of computing from a more to look at the evolution of computing from a more
holistic perspective” holistic perspective” IBM’s Autonomic Computing - one of the Grand IBM’s Autonomic Computing - one of the Grand
ChallengesChallenges Parallel processing on non-dedicated clusters Parallel processing on non-dedicated clusters
could benefit from the Autonomic Computing could benefit from the Autonomic Computing vision vision
Aim: to show a general design of services and Aim: to show a general design of services and initial implementation of a system that moves initial implementation of a system that moves parallel processing on clusters to the computing parallel processing on clusters to the computing mainstream using the Autonomic Computing mainstream using the Autonomic Computing visionvision
June 2004June 2004 COSET’2004COSET’2004 44
IBM’s Autonomic ComputingIBM’s Autonomic Computing
The name “autonomic” has not caught on The name “autonomic” has not caught on everywhere, if only because it’s IBM’severywhere, if only because it’s IBM’s– Microsoft – “trustworthy”Microsoft – “trustworthy”– Others prefer more generic – “self-managing” Others prefer more generic – “self-managing”
Many see Many see “autonomic computing” as one of “autonomic computing” as one of the basic parts of a revolutionary technology the basic parts of a revolutionary technology thatthat– Will start the new .com boomWill start the new .com boom– Will move parallel computing on clusters to the Will move parallel computing on clusters to the
Computing mainstreamComputing mainstream
June 2004June 2004 COSET’2004COSET’2004 55
IBM’s Autonomic ComputingIBM’s Autonomic Computing Characteristics of autonomic computing Characteristics of autonomic computing
systemssystems– knows itselfknows itself– configures and reconfigures itself under varying configures and reconfigures itself under varying
and unpredictable conditionsand unpredictable conditions– optimizes its workingoptimizes its working– performs something akin to healingperforms something akin to healing– provides self-protectionprovides self-protection– knows its surrounding environmentknows its surrounding environment– exists in an open (non-hermetic) environment exists in an open (non-hermetic) environment – anticipates the optimized resources needed while anticipates the optimized resources needed while
keeping its complexity hidden keeping its complexity hidden
June 2004June 2004 COSET’2004COSET’2004 66
Related WorkRelated Work
A number of projects related to Autonomous A number of projects related to Autonomous Computing are mentioned by the IBM Computing are mentioned by the IBM
websitewebsite While many of the reported projects engage While many of the reported projects engage
in some aspects of Autonomic Computing in some aspects of Autonomic Computing none engage in research to develop a none engage in research to develop a system that has all eight of the system that has all eight of the characteristics required characteristics required
None of the projects addresses parallel None of the projects addresses parallel processing, in particular parallel processing processing, in particular parallel processing on non-dedicated clusters.on non-dedicated clusters.
June 2004June 2004 COSET’2004COSET’2004 77
Design of Autonomic Elements Design of Autonomic Elements (Services) Providing Autonomic (Services) Providing Autonomic Computing on Non-dedicated Computing on Non-dedicated ClustersClusters We have proposed and designed a set of We have proposed and designed a set of
autonomic elements that must be provided autonomic elements that must be provided to develop an autonomic computing to develop an autonomic computing environment on a non-dedicated clusterenvironment on a non-dedicated cluster
Three component levelsThree component levels– ServicesServices– ComputersComputers– Non-dedicated clusterNon-dedicated cluster
Note: we have not addressedNote: we have not addressed– Hardware aspectsHardware aspects– Administration aspectsAdministration aspects
June 2004June 2004 COSET’2004COSET’2004 88
Cluster Knows ItselfCluster Knows Itself
A need for resource discoveryA need for resource discovery This autonomic element runs on each computerThis autonomic element runs on each computer ActivitiesActivities
– Acquires knowledge of static parameters of Acquires knowledge of static parameters of computers computers processor type (e.g., speed)processor type (e.g., speed) memory sizememory size available softwareavailable software
– Acquires knowledge of dynamic parameters of Acquires knowledge of dynamic parameters of clusters clusters computers’ loadcomputers’ load available memoryavailable memory communication pattern and volumecommunication pattern and volume
June 2004June 2004 COSET’2004COSET’2004 99
Resource Discovery Service Resource Discovery Service DesignDesign
ResourceDiscovery
Communication Pattern & Load
Local Communication Load
CPU Main Memory
RemoteCommunication Load
Computational Load & Parameters
Computer i
ResourceDiscovery
CPU Main Memory
Computation element
1
Computer j
Computation element
2
Computation element
2
Computation element
1
June 2004June 2004 COSET’2004COSET’2004 1010
Cluster Configures and Cluster Configures and Reconfigures Itself under Reconfigures Itself under Varying and Unpredictable Varying and Unpredictable ConditionsConditions
In a non-dedicated cluster there are times when In a non-dedicated cluster there are times when – Some computers are lightly loaded or idleSome computers are lightly loaded or idle– Some computers cannot be used Some computers cannot be used
owners removed them from a shared pool of resources owners removed them from a shared pool of resources are heavy loadedare heavy loaded
To offer high availability, i.e., to configure and To offer high availability, i.e., to configure and reconfigure itself, the systemreconfigure itself, the system– Forms parallel virtual clusters adaptively and Forms parallel virtual clusters adaptively and
dynamically dynamically – Forming is based on load and changing resourcesForming is based on load and changing resources
June 2004June 2004 COSET’2004COSET’2004 1111
Availability Service DesignAvailability Service Design
RD
RD
Availability Services
Virtual Parallel Cluster (t0)
Where times t0< t1< t2< t3
Virtual Parallel Cluster (t2)Virtual Parallel Cluster (t3)
RD
RD RD
RD
RD
Virtual Parallel Cluster (t1)
RD
June 2004June 2004 COSET’2004COSET’2004 1212
Cluster Should Optimize Its Cluster Should Optimize Its WorkingWorking
Application computation elements should be Application computation elements should be placed optimallyplaced optimally
To improve performance there is a need forTo improve performance there is a need for– Computation loadComputation load– Available memoryAvailable memory– Communication costsCommunication costs
To optimize cluster’s working there isTo optimize cluster’s working there is– Static allocation and load balancingStatic allocation and load balancing– Ability to change performance indices that reflect Ability to change performance indices that reflect
user objectivesuser objectives– Computation element migration, creation and Computation element migration, creation and
duplicationduplication– Setting of computation priorities of applicationsSetting of computation priorities of applications
June 2004June 2004 COSET’2004COSET’2004 1313
High Performance Service High Performance Service DesignDesign
Virtual Parallel Cluster
C1
P1
C2
P2
C3
Pi
Migration
Cn
AvailabilityServices
{ where: P1 → C1,P2 → C2,
………{Pi, Pj} → Cn }
{where, which, when: Pi : Cn → C3}
Global Scheduler
StaticAllocation
LoadBalancing
Pj
June 2004June 2004 COSET’2004COSET’2004 1414
Cluster Should Perform Cluster Should Perform Something Akin To HealingSomething Akin To Healing Hardware and software faults can occurHardware and software faults can occur Failures lead to the termination of Failures lead to the termination of
computations computations To provide something akin to healing To provide something akin to healing
– Faults are identified and reportedFaults are identified and reported– Checkpointing of parallel computation element of Checkpointing of parallel computation element of
applications is providedapplications is provided– Recovery from failures is employedRecovery from failures is employed– Migrating applications from faulty computers to Migrating applications from faulty computers to
healthy computers is carried out automaticallyhealthy computers is carried out automatically– Redundant/replicated services are providedRedundant/replicated services are provided
June 2004June 2004 COSET’2004COSET’2004 1515
Self-Healing Service DesignSelf-Healing Service Design
Computation Element i
Checkpointing (coordinated)
Recovery
Checkpoint forComputation Element i
C1
Checkpointfor
Compute Elem i
Checkpointfor
Compute Elem i
Disk
Compute Elem i after crash recovery
C2 Cj
Ck
June 2004June 2004 COSET’2004COSET’2004 1616
Clusters Should Provide Self-Clusters Should Provide Self-ProtectionProtection Computation elements of parallel applications Computation elements of parallel applications
are distributedare distributed Computation elements communicate using Computation elements communicate using
messagesmessages They are the subject of passive and active They are the subject of passive and active
attacksattacks To provide self-protection:To provide self-protection:
– Virus detection and recovery must be offeredVirus detection and recovery must be offered– Resource protection should be a mandatory serviceResource protection should be a mandatory service– Encryption, as a countermeasure against passive Encryption, as a countermeasure against passive
attacks, should be usedattacks, should be used– Authentication, as a countermeasure against active Authentication, as a countermeasure against active
attacks, should be usedattacks, should be used
June 2004June 2004 COSET’2004COSET’2004 1717
To Allow a System to Know Its To Allow a System to Know Its Surrounding Environment and to Surrounding Environment and to Prevent a System From Existing in Prevent a System From Existing in a Hermetic Environmenta Hermetic Environment
There are applications that require There are applications that require – More computation powerMore computation power– Specialized softwareSpecialized software– Unique peripheral devices etcUnique peripheral devices etc
Many owners cannot afford such resourcesMany owners cannot afford such resources Some owners can offer their services and Some owners can offer their services and
resources to appropriate usersresources to appropriate users
June 2004June 2004 COSET’2004COSET’2004 1818
To Allow a System to Know Its To Allow a System to Know Its Surrounding Environment and to Surrounding Environment and to Prevent a System From Existing in Prevent a System From Existing in a Hermetic Environmenta Hermetic Environment
To benefit from existing unique resourcesTo benefit from existing unique resources– Resource discovery of other clusters is providedResource discovery of other clusters is provided– Advertising services is in placeAdvertising services is in place– Systems are able to cooperateSystems are able to cooperate– Negotiation is in useNegotiation is in use– Brokerage of resources and services are usedBrokerage of resources and services are used– Resources are shared in a distributed mannerResources are shared in a distributed manner– ““The move toward a grid” should be in placeThe move toward a grid” should be in place
June 2004June 2004 COSET’2004COSET’2004 1919
Grid-like Service DesignGrid-like Service Design
Brokerage Services
Computational Services
Storage/Memory Services
Printer Services
Information Services
Advertisement
Exporting Services
WithdrawalServices
ImportRequests
Cluster 1
Brokerage Servicess
Cluster nCluster 3
Cluster 2
Brokerage Servicess
Brokerage Servicess
June 2004June 2004 COSET’2004COSET’2004 2020
A Cluster Should Anticipate the A Cluster Should Anticipate the Optimized Resources Needed Optimized Resources Needed While Keeping Its Complexity While Keeping Its Complexity HiddenHidden
The scarcity of software to assist ordinary The scarcity of software to assist ordinary programmers limits the harnessing of the programmers limits the harnessing of the computing power of non-dedicated clusterscomputing power of non-dedicated clusters
This impliesThis implies– A programming environment simple to useA programming environment simple to use– Knowledge of resource distribution not neededKnowledge of resource distribution not needed– Message passing and shared memory Message passing and shared memory
programming supported transparentlyprogramming supported transparently
June 2004June 2004 COSET’2004COSET’2004 2121
Easy Programming Service Easy Programming Service DesignDesign
Communication Primitives
System Servicesof an
Operating System
Kernel Services of anOperating System
Programming Environment
Shared Memory
Message Passing
or PVM / MPI
DSM
June 2004June 2004 COSET’2004COSET’2004 2222
The Holos Services for The Holos Services for Autonomic Computing ClustersAutonomic Computing Clusters
Holos is built to demonstrate that it is possible to Holos is built to demonstrate that it is possible to develop an autonomic non-dedicated cluster thatdevelop an autonomic non-dedicated cluster that– could be routinely employed by ordinary engineers, could be routinely employed by ordinary engineers,
managers, etc managers, etc – able to support next generation application software able to support next generation application software
executing on clustersexecuting on clusters We followed We followed the IBM’s vision recommendations the IBM’s vision recommendations
regarding autonomic elementsregarding autonomic elements We decided to view autonomic elements as processesWe decided to view autonomic elements as processes
– Each computer is a multi-process systems with its objectivesEach computer is a multi-process systems with its objectives– A cluster is a set of multi-process systems with its objectivesA cluster is a set of multi-process systems with its objectives
June 2004June 2004 COSET’2004COSET’2004 2323
HolosHolos
System Servers
Kernel Servers
GlobalScheduler
Execution Server
Migration Server
Check-point
Server
Resource Discovery Server
DSM Server
Broker-age
Server
IPCServer
ProcessManageServer
SpaceManageServer
GENESISMicrokernel
Parallel Processes
MP / PVM / MPI
Process
DSMProcess
Holos was developed Holos was developed based on the P2P and based on the P2P and microkernel paradigmsmicrokernel paradigms
The microkernel provides The microkernel provides services such as services such as
– local IPClocal IPC– basic paging operationsbasic paging operations– interrupt handling interrupt handling – context switching context switching
Three groups of Three groups of processes: processes:
– kernel serverskernel servers– system serverssystem servers– application processes application processes
Kernel and system servers Kernel and system servers are stationary, application are stationary, application processes are mobile processes are mobile
All processes All processes communicate using communicate using messagesmessages
June 2004June 2004 COSET’2004COSET’2004 2424
System Servers Form a Basis System Servers Form a Basis of an Autonomic Operating of an Autonomic Operating System for Nondedicated System for Nondedicated ClustersClusters
Resource Discovery Server - collects data Resource Discovery Server - collects data about computation and communication loadabout computation and communication load
Availability Server - dynamically and Availability Server - dynamically and adaptively forms a parallel virtual cluster for adaptively forms a parallel virtual cluster for the applicationthe application
Global Scheduling Server – maps application Global Scheduling Server – maps application processes using static allocation and dynamic processes using static allocation and dynamic load balancing on the computers of the load balancing on the computers of the virtual parallel clustervirtual parallel cluster
June 2004June 2004 COSET’2004COSET’2004 2525
System Servers Form a Basis System Servers Form a Basis of an Autonomic Operating of an Autonomic Operating System for Nondedicated System for Nondedicated ClustersClusters
Execution Server - coordinates the single, Execution Server - coordinates the single, multiple and group creation and duplication multiple and group creation and duplication of application processes on both local and of application processes on both local and remote computersremote computers
Migration Server - coordinates Migration Server - coordinates movingmoving application application processes processes to other computersto other computers
DSM Server - hides the distributed nature of DSM Server - hides the distributed nature of the cluster’s memory and allows writing the cluster’s memory and allows writing code as though using physically shared code as though using physically shared memory memory
June 2004June 2004 COSET’2004COSET’2004 2626
System Servers Form a Basis System Servers Form a Basis of an Autonomic Operating of an Autonomic Operating System for Nondedicated System for Nondedicated ClustersClusters Checkpoint Server - coordinates creation of Checkpoint Server - coordinates creation of
checkpoints for an executing application checkpoints for an executing application Fault Recovery Server – recovers application Fault Recovery Server – recovers application
processes / applications using checkpointsprocesses / applications using checkpoints IAC Server - supports remote interprocess IAC Server - supports remote interprocess
communication and supports group communication and supports group communication within sets of application communication within sets of application processesprocesses
Brokerage Server – supports advertising and Brokerage Server – supports advertising and sharing services through service exporting, sharing services through service exporting, importing and revokingimporting and revoking
June 2004June 2004 COSET’2004COSET’2004 2727
Holos Possesses the Autonomic Holos Possesses the Autonomic Computing CharacteristicsComputing Characteristics
Autonomic Computing Requirement Cooperating Holos Servers –Relationships Among Autonomic Elements
To allow a system to know itself Resource Discovery Server
A system must configure and reconfigure itself under varying and unpredictable conditions
Resource Discover Server, Global Scheduling Server, Migration Server, Execution Server, and Availability Server
A system must optimize its working Global Scheduling Server, Migration Server, and Execution Server
A system must perform something akin to healing Checkpoint Server, Recovery Server, Migration Server, Global Scheduling Server
A system must provide self-protection Capabilities in the form of System Names
A system must know its surrounding environment Resource Discovery Server, and Brokerage Server
A system cannot exist in a hermetic environment Interprocess Communication Server, and Brokerage Server
A system must anticipate the optimized resources needed while keeping its complexity hidden (most critical for the user)
DSM Server, and Execution Server, DSM Programming Environment, Message Passing Programming Environment, PVM/MPI Programming Environment
June 2004June 2004 COSET’2004COSET’2004 2828
ConclusionConclusion
Autonomic computing has been shown to be a Autonomic computing has been shown to be a basic part of a revolutionary technology that basic part of a revolutionary technology that – Could move parallel computing on non-dedicated Could move parallel computing on non-dedicated
clusters to the computing mainstreamclusters to the computing mainstream– (Will start the new .com boom – is to be shown)(Will start the new .com boom – is to be shown)
The development of the Holos cluster operating The development of the Holos cluster operating system demonstrates that it is possible to build system demonstrates that it is possible to build an autonomic non-dedicated clusteran autonomic non-dedicated cluster
The Holos cluster operating system has been The Holos cluster operating system has been built from scratchbuilt from scratch