+ All Categories
Home > Documents > Processor failure recovery for a resource sharing algorithm

Processor failure recovery for a resource sharing algorithm

Date post: 20-Sep-2016
Category:
Upload: mc
View: 218 times
Download: 0 times
Share this document with a friend
8
Processor failure recovery for a resource sharing algorithm I.A. Newman, B.Sc, Ph.D.. F.B.C.S.. R.P. Stallard. M.A., A.M.B.C.S., and M.C. Woodward, B.Sc. Ph.D.. A.M.B.C.S. Indexing terms: Algorithms, Digital computers and computation, Reliability Abstract: With the increase in popularity of distributed computer systems, the reliability of the system as a whole is becoming more important. A recently published combined resource sharing algorithm showed how the atomic operations required for resource management in a closely coupled multiprocessor system could be pro- vided. The paper describes a recovery system that may be incorporated within the earlier algorithm to enable continued and correct operation of the system despite the failure of one or more component processors. A distributed simulation of the recovery mechanism is described and results from simulation runs are presented. 1 Introduction In most systems a resource (a peripheral, a file or a data structure) is attached to a single processor which is responsible for managing the access to that resource. In the special case of a closely coupled system in which there is a shared medium accessible by several processors, it is possible to have passive resources (usually data structures) which can be used by processes running on different pro- cessors. Under these circumstances it is essential that each passive resource is controlled by a resource management scheme which can ensure that a process can obtain exclu- sive access to the resource. Giving one process exclusive access ensures that there can be no corruption of the resource if all processes complete their tasks correctly. However, if the processor on which the process is running fails while that process is accessing a resource, no other process can gain access to the resource and the whole system could deadlock. This paper discusses the extensions to an existing resource management scheme which will permit a resource to be reclaimed after a failure has occurred. The manage- ment scheme provides the necessary coordination of access for resources which may be directly accessed by several processors. Proposals have been presented for higher level structuring methodologies giving reliability, but these depend in turn on more primitive coordination (for example, the mutex data type of the Argus language [12]) which would typically be provided by the scheme under discussion. Techniques that may be used to provide relia- bility in the context of a loosely coupled system, where each resource is directly accessible by only one processor, may be found in the literature (for example, Reference 4). Two previous reports described a resource master allo- cation algorithm [9] and an algorithm for the recovery of processor failures during resource claim and release [14]. This paper considers a recovery scheme for an improved resource allocation algorithm [8], and presents results for a simulation model implemented on an acutal multiple processor system. The basic system synchronisation primitives that enable passive resources to be safely shared between processors are termed get (the routine to gain access to the resource) and put (the routine to release access). These routines are analogous in a multiple processor system to the P and V operations of Dijkstra for concurrent process control on a single processor. Paper 4426E (Cl, C2), first received 3rd April and in revised form 11th September 1985 The authors are with the Department of Computer Studies, University of Tech- nology, Loughborough, Leicestershire, LEU 3TU, United Kingdom IEE PROCEEDINGS, Vol. 133, Pt. E, No. 2, MARCH 1986 The recovery algorithm is designed to cope with failures during the processes of claiming and releasing a resource under the control of the allocation algorithm. A method of safe update [13] or checkpointing [10] could be used to recover failures during the actual use of the resource. This recovery algorithm can, however, detect if the processor could have been using the resource and therefore activate an appropriate recovery procedure. The resource allocation algorithm under consideration is essentially a union of two distinct approaches to resource allocation. One method, the resource master type, ensures that there is always a processor that owns the resource. Ownership of the resource is only passed when another processor indicates that it wishes to use the resource. The other method, termed bartering [3, 6], leaves the resource in an unowned state when a processor has finished using it. In the latter scheme a processor becomes the owner of a resource by bidding against all other requesters, taking ownership only when it has been estab- lished that no other processor has a prior claim to the resource. The resource master method performs well when there is high demand for the resource (there is normally a processor to pass ownership to), and the bartering method when there is low demand (there is no other processor to bid against). The new allocation algorithm uses both methods of ownership passing, choosing which method to use according to the number of requesting processors when the resource is released. Any allocation algorithm must provide a fair service for all participating processors, and must never lead to a pro- cessor being starved of access to the resource. Ideally a 'first come, first served' service should be provided, but a 'round robin' service will normally be adequate. The com- bined resource allocation algorithm provides the latter form of service as it is based on the abstract resource ring algorithm (a member of the resource master class). This uses a fixed cyclic ordering termed the ring to provide round robin service. A ring has the additional advantage that each processor has a unique successor and prede- cessor, this will prove useful in the recovery algorithm described in later sections. The ordering in the ring could be inferred from private physical connections but could also be defined by a data structure in a shared medium. 2 Processor failure An essential part of safe processor recovery after a failure is the ability to detect the failure. One method of detection is for a separate hardware model (sometimes called an arbiter) to monitor all the processors and unilaterally 79
Transcript
Page 1: Processor failure recovery for a resource sharing algorithm

Processor failure recovery for a resourcesharing algorithm

I.A. Newman, B.Sc, Ph.D.. F.B.C.S.. R.P. Stallard. M.A., A.M.B.C.S., andM.C. Woodward, B.Sc. Ph.D.. A.M.B.C.S.

Indexing terms: Algorithms, Digital computers and computation, Reliability

Abstract: With the increase in popularity of distributed computer systems, the reliability of the system as awhole is becoming more important. A recently published combined resource sharing algorithm showed how theatomic operations required for resource management in a closely coupled multiprocessor system could be pro-vided. The paper describes a recovery system that may be incorporated within the earlier algorithm to enablecontinued and correct operation of the system despite the failure of one or more component processors. Adistributed simulation of the recovery mechanism is described and results from simulation runs are presented.

1 Introduction

In most systems a resource (a peripheral, a file or a datastructure) is attached to a single processor which isresponsible for managing the access to that resource. Inthe special case of a closely coupled system in which thereis a shared medium accessible by several processors, it ispossible to have passive resources (usually data structures)which can be used by processes running on different pro-cessors. Under these circumstances it is essential that eachpassive resource is controlled by a resource managementscheme which can ensure that a process can obtain exclu-sive access to the resource. Giving one process exclusiveaccess ensures that there can be no corruption of theresource if all processes complete their tasks correctly.However, if the processor on which the process is runningfails while that process is accessing a resource, no otherprocess can gain access to the resource and the wholesystem could deadlock.

This paper discusses the extensions to an existingresource management scheme which will permit a resourceto be reclaimed after a failure has occurred. The manage-ment scheme provides the necessary coordination of accessfor resources which may be directly accessed by severalprocessors. Proposals have been presented for higher levelstructuring methodologies giving reliability, but thesedepend in turn on more primitive coordination (forexample, the mutex data type of the Argus language [12])which would typically be provided by the scheme underdiscussion. Techniques that may be used to provide relia-bility in the context of a loosely coupled system, whereeach resource is directly accessible by only one processor,may be found in the literature (for example, Reference 4).

Two previous reports described a resource master allo-cation algorithm [9] and an algorithm for the recovery ofprocessor failures during resource claim and release [14].This paper considers a recovery scheme for an improvedresource allocation algorithm [8], and presents results fora simulation model implemented on an acutal multipleprocessor system.

The basic system synchronisation primitives that enablepassive resources to be safely shared between processorsare termed get (the routine to gain access to the resource)and put (the routine to release access). These routines areanalogous in a multiple processor system to the P and Voperations of Dijkstra for concurrent process control on asingle processor.

Paper 4426E (Cl, C2), first received 3rd April and in revised form 11th September1985The authors are with the Department of Computer Studies, University of Tech-nology, Loughborough, Leicestershire, LEU 3TU, United Kingdom

IEE PROCEEDINGS, Vol. 133, Pt. E, No. 2, MARCH 1986

The recovery algorithm is designed to cope with failuresduring the processes of claiming and releasing a resourceunder the control of the allocation algorithm. A method ofsafe update [13] or checkpointing [10] could be used torecover failures during the actual use of the resource. Thisrecovery algorithm can, however, detect if the processorcould have been using the resource and therefore activatean appropriate recovery procedure.

The resource allocation algorithm under considerationis essentially a union of two distinct approaches toresource allocation. One method, the resource master type,ensures that there is always a processor that owns theresource. Ownership of the resource is only passed whenanother processor indicates that it wishes to use theresource. The other method, termed bartering [3, 6], leavesthe resource in an unowned state when a processor hasfinished using it. In the latter scheme a processor becomesthe owner of a resource by bidding against all otherrequesters, taking ownership only when it has been estab-lished that no other processor has a prior claim to theresource. The resource master method performs well whenthere is high demand for the resource (there is normally aprocessor to pass ownership to), and the bartering methodwhen there is low demand (there is no other processor tobid against). The new allocation algorithm uses bothmethods of ownership passing, choosing which method touse according to the number of requesting processorswhen the resource is released.

Any allocation algorithm must provide a fair service forall participating processors, and must never lead to a pro-cessor being starved of access to the resource. Ideally a'first come, first served' service should be provided, but a'round robin' service will normally be adequate. The com-bined resource allocation algorithm provides the latterform of service as it is based on the abstract resource ringalgorithm (a member of the resource master class). Thisuses a fixed cyclic ordering termed the ring to provideround robin service. A ring has the additional advantagethat each processor has a unique successor and prede-cessor, this will prove useful in the recovery algorithmdescribed in later sections. The ordering in the ring couldbe inferred from private physical connections but couldalso be defined by a data structure in a shared medium.

2 Processor failure

An essential part of safe processor recovery after a failureis the ability to detect the failure. One method of detectionis for a separate hardware model (sometimes called anarbiter) to monitor all the processors and unilaterally

79

Page 2: Processor failure recovery for a resource sharing algorithm

decide if a processor has failed. Another method allowseach processor to monitor all the other processors, a pro-cessor being designated as failed if the majority of the pro-cessors agree as to the failure [11,5].

There are a variety of ways that a malfunction can bedetected. One method insists that a processor mustrespond to a message within a certain time limit to bedeemed to function normally, another permits processorsto inspect each other's clock driven timing variables, inwhich case failure is assumed if the variables are notupdated correctly. A third method requires each processorto send all other processors a message at fixed time inter-vals assuming a processor to have failed if its messages arenot received.

All of these methods can cause a failure condition to beerroneously raised. Hardware arbiters have the drawbackof introducing new possible faults if the arbiter itself mal-functions. A voting system relies on a safe method ofcounting votes. The principle used for designing a fault tol-erant system must be to chose methods that have theminimum risk of error. The inherent problem is that unlessevery action taken by a processor is checked (at the ins-truction level) some categories of fault will not be detected.Even then there will always be the risk that the fault detec-tion system will itself malfunction. No particular mecha-nism for fault detection is assumed for the purposes of thispaper.

There are also limits as to which faults are recoverable.Generally, the more fault tolerant a system is, the slowerthe software will run and the more expensive the hardware

1)2)3)4)5)

6)7)8)9)

10)11)12)13)14)15)16)17)18)19)20)21)22)23)24)25)26)27)28)29)30)31)32)33)33)35)36)37)38)39)40)41)42)43)44)

80

constunowned = (* special processor number *);

type

processor = (• some specification of all processor values includingunowned, with each processor having a differentidentification *);

varstatus : shared array [processor] of **

(dontwant, needing, claiming, want); **(* initially dontwant *) **

**claimer: shared boolean;

(* initially false *)putting : shared processor; **

(* initially owned *) **last: shared processor;

(• initially an arbitrary processor •)owner : shared processor;

(* initially unowned *)me : (* local *) processor;

(* set to my processor number •)test: (* local *) processor;

function next (p : processor): processor;begin(• return the processor number of the next processor in the ring *)end;procedure barter;begin

claimer := true;test := next(last);while test < > me dobegin

while status [test] = claiming do; **test: = next(test)

end;if owner = unowned then

owner := me;claimer := false

end;procedure wait for ownership;begin

(* this procedure may be implemented by returning control to thesystem scheduler as is usual under uniprocessor resourceallocation *)

will be. This recovery algorithm is designed to cope withthe failure of one or more processors, but only whencertain conditions apply.

(a) Once a processor is deemed to have failed, that pro-cessor must not perform any modifications that affectother processors (for example, changes to shared memoryor shared peripherals). This type of failure covers thecommon power failure type of fault but not faults that areintermittent or destructive.

(b) The second condition is that the remaining pro-cessors can still communicate via the shared medium,failure of the medium leads to partitioning into onemember rings, and thus loss of the distributed system andits sharing of resources.

(c) Thirdly, it is assumed that the assignment and condi-tional statements in the hybrid algorithm are point oper-ations (indivisible); if this is not the case, then some formof safe update [5] is required to execute these statements.

(d) Finally, the data structures representing the state ofthe resource ownership must remain uncorrupt. Should aprocessor fail by randomly writing to memory such thatthe data structures are modified, correct access to the pro-tected resources cannot be guaranteed, even though acorrect state may be reached in the future.

3 Hybrid resource allocation algorithm

An earlier paper [8] described the allocation algorithm insome detail; this section describes the changes made to theoriginal implementation to simplify the process of

45)46)47)48)49)50)51)52)53)54)55)56)57)58)59)60)61)62)63)64)65)66)67)68)69)70)71)72)73)74)75)76)77)78)79)80)81)82)83)84)85)

while owner < > me doend;procedure wait for put complete;begin

while putting < > unowned doend;procedure find claimer or release;begin

test := next (me);while test < > me dobegin

if status [test] < > dontwant thenbegin

owner := test;test: = me

endelse

test := next (test)end;if owner = me then

owner := unowned;end;procedure put;begin

wait for put complete;putting := me;last: = me;status [me] := dontwant;find claimer or release;putting := unowned

end;procedure get;begin

status [me] := needing;wait for put complete;status [me] := claiming;if (owner = unowned) and (not claimer) then

barter;status [me] := want;wait for ownership

end;

Fig. 1 Hybrid resource allocation algorithm

1EE PROCEEDINGS, Vol. 133, Pt. E, No. 2, MARCH 1986

Page 3: Processor failure recovery for a resource sharing algorithm

recovery. The required modifications are, for consistency,presented in the notation used in the earlier paper, that isin Pascal.

In the original algorithm, a complex set of interlocksare necessary to control switching between modes of oper-ation and to allow arbitrarily long time intervals betweeneach execution step. To implement these interlocks anumber of control flags are used, and therefore theresource can be in a large number of possible states when afailure occurs.

The implementation of the algorithm used for therecovery is shown in Fig. 1, the line numbers used beingthe same as in the previous paper [8]. Lines that havebeen modified are indicated by two stars '**' in the right-hand column.

Two changes have been made to the original algorithm.They enable the point of failure to be established withgreater confidence and thus simplify the tests required todecide which actions are needed to re-establish the integ-rity of the resource. They do not affect the structure or theperformance of the algorithm.

The first amendment is that the want and claim booleanflags are replaced by a single status variable which cantake the values dontwant, needing, claiming and want.This enables the state want = true and claim = false to besplit into two different status values depending on whetherthe processor has yet to decide to enter the bartering phase(status needing), or else has finished bartering (statuswant). The state of want = false and claim = true is notpossible in the previous algorithm. The correspondencebetween the old and new states is as follows:

Previous algorithm [8] Recovery algorithm

Want Claim Status

falsetrue

true

true

falsefalse

false

true

dontwantneeding (before barter

lines 79 or 80 get)want (after barter

line 84 to line 72)claiming

The other change is that the boolean putting flag nowbecomes a variable which contains either the identifier ofthe processor performing the put or the value unowned.This enables the recovery algorithm to discover which pro-cessor is performing a put.

4 Recovery algorithm

The algorithm was written with the intention of handlingall single processor failures and most multiple processorfailures, subject to the constraints in Section 2. To achievethis aim, the method first diagnoses, as accurately as pos-sible, where the failure occurred and then changes the ringdata, step by step, performing the same operations as the

failed processor would have made, until the dontwant andunowned state is reached. This method ensures that noinconsistent states are ever reached and that if the re-coverer itself fails then the recovery can be resumed fromthat point by another processor.

Because the recovery algorithm must recover a failurewithin the routines that provide locks between processors,the algorithm can not itself lock out other processors whilethe recovery is taking place. Thus the recoverer processcannot assume that shared control data will remain static.

The most severe repercussion of the inability to freezethe data structure is that while a processor is in the midstof recovery, the processor that has failed may be givenownership of the resource.

To demonstrate the correctness of the algorithm, all thepossible resource states are presented with the appropriaterecovery action. The algorithm is a straightforward synthe-sis of these actions.

The primary subdivision of states is on the basis of thecurrent resource ownership, it can be owned by the failedprocessor, owned by another processor or unowned. Theseare further subdivided according to the status of the failedprocessor (namely: dontwant, needing, claiming, want).

Tables 1, 2 and 3 list these states, the situation and thenecessary actions to recover from the failure. It is assumedthat the status of the failed processor cannot change afterthe failure nor, if it owns the resource, can the ownership.The ownership can, however, change between beingunowned and some processor owning, or between otherdifferent processors owning during the recovery.

The synthesis of all these actions is shown in a possibleimplementation of the recovery algorithm (Fig. 2).Although the broad approach is the same as that outlinedabove, certain short cuts have been made. The first of theseavoids the call to the waitfor ownership procedure (Fig. 1lines 40-46). The recovery should be as fast as possible andso this state is bypassed by changing status to dontwant,missing out the want state. This, however, introduces thepossibility of being given the resource while in a dontwantstate, as the processor which currently owns the resourcemay have noted the request for ownership and decided topass the resource to the failed processor before the flag wascleared to dontwant. The recovery is not, therefore, com-plete until it is certain that ownership can no longer to begiven to the failed processor. If no put operation is inprogress, then this is ensured automatically. If, however, aput is taking place, then the recoverer must wait for theoperation to be completed. Fortunately these tests can beimplemented as a single while loop testing the putting flag.If after this loop the failed processor is not the owner, it isno longer possible for it to become the owner.

There must also be a mechanism for deciding which ofthe processors should perform the recovery. There mustonly be one processor recovering the failed processor, andso there must be some mechanism to decide which pro-cessor is eligible to perform it. The ring structure itself

Table 1 : The failed processor owns the resource

Status flag

dontwant

needing

claiming

want

Situation

In put, lookingfor a new ownerIn get, have beengiven resourcein get, havebecome owner

In get, or usingresource, or startedput

Line number

73, 51-66

79, 47-50

81-82, 27-39

84, 40-46,67-72, 47-50or using

Action

Set putting to unownedPerform putSet status to wantPerform putReset claimer to falseSet status to wantPerform putIf it was putting then

clear puttingPerform put

IEE PROCEEDINGS, Vol. 133, Pt. E, No. 2, MARCH 1986 81

Page 4: Processor failure recovery for a resource sharing algorithm

ensures each processor has a unique successor and so areasonable method is that the predecessor of a failed pro-cessor in the ring should perform the recovery. This has

If the failures are not adjacent in the ring, then eachrecovery can be run simultaneously and there is no differ-ence to the case of a single failure, because the recoverer

Table 2: Resource is unowned (at present)

Status flag

dontwant

needing

claiming

want

Situation

In put or notusing resource

In get, startedclaim

In get, performingbarter

In get, waiting tobe given ownership(another processorhas a prior claim)

Line number

65-66,73-75,or not using

79,47-50

81-82,27-39

84,40-46

Action

If this processor wasputting thenSet putting to unowned

Wait untilputting = unowned (1)

If owner perform putSet status to dontwantWait until putting = unownedIf owner perform putPerform barter lines 45-61Set status to dontwantWait until putting = unownedIf owner perform putSet status to dontwantWait until putting = unownedIf owner perform put

(1) The last two lines for the case status = dontwant are only required for therecovery of multiple failures

Table 3: Resource owned by another processor (at present)

Status

dontwant

needingand wantclaiming

Situation

In put havingset a new owner ornot using resource

Identical situation and

In get performingbarter, can not beprocessor that shouldgain ownership

Line number

59-66,73-75,or not using

Action

If this processor wasputting thenSet putting to unowned

Wait untilputting = unowned (1)

If owner perform putaction as for Table 2 (resource unowned)

81-82,27-39

Reset claimer to falseSet status to dontwantWait until putting = unownedIf owner perform put

(1) The last two lines for the case status = dontwant are only required for therecovery of multiple failures

the advantage that a processor need only detect a failure inits immediate successor rather than for all processors inthe system.

5 Multiple failures

The above description assumes only one failure is beingrecovered at any one time. It would be desirable for severalfailures in the system to be recovered at the same time.

performs operations identical to a normal running pro-cessor.

If there are two adjacent failures in the ring, e.g., in pro-cessors C and D in Fig. 3, then one processor must try torecover both of them simultaneously because the recoveryof one of them may rely on the recovery of the other. Thiscan be implemented by starting a recovery task for eachadjacent failed processor found when checking around thering. In this example processor B would start a recoverytask for both C and D.

1) procedure recover (it: processor);2) var test : processor;3) begin4) if owner = it then {the case of the failed processor owning}5) begin6) if (status [it] = dontwant) or7) (status [it] = want) then8) begin9) if putting = it then putting := unowned

10) end11) else if status [it] = claiming then12) begin13) claimer := false;14) status [it] := want15) end16) else if status [it] = needing then17) begin18) status [it] := want19) end;20) put (it) {this is a normal put operation except 'it' replaces}21) end {all occurrences of'me'in the procedure}22) else23) begin {resource unowned or owned by someone else}

82

24) if (owner = unowned) and (status [it] = claiming) then25) begin26) test := next (last); {perform normal get barter on behalf}27) {of the failed processor}28) while test < > it do29) begin30) while status [test] = claiming do;31) test := next (test)32) end;33) if owner = unowned then34) owner := it;35) claimer := false36) end37) else if (status [it] = dontwant) and (putting = it) then38) putting : = unowned39) else if status [it] = claiming then40) claimer := false;41) status [it] := dontwant;42) while putting < > unowned do;43) if owner = it then put (it)44) end45) end;Fig. 2 Recovery algorithm

IEE PROCEEDINGS, Vol. 133, Pt. E, No. 2, MARCH 1986

Page 5: Processor failure recovery for a resource sharing algorithm

The more difficult case is when a processor fails whileperforming a recovery (or multiple recoveries). The abovealgorithm will start up a normal recovery task for the

Fig. 3 An example processor ordering

recoverer processor and those it was recovering. In theexample ring, a failure in B when it is recovering C and Dwould lead to processor A running a recovery task for B,C and D.

The only additional hazard in restarting a recovery isthat the short cuts which were made in the algorithm mayproduce states that would not normally occur.

There is only one such possible state for this particularrecovery algorithm and that is if the failed processor is setto the dontwant state and then the recoverer process itselffails. The problem is that even though the original failedprocessor is in a dontwant state and is not the owner, theprocessor may still be given ownership of the resource ifany of the remaining processors is in the process of re-leasing the resource (i.e. within put). There must, therefore,be a check that a processor in the dontwant state is notabout to be given ownership. The recoverer must alwayswait for the current put to complete (if one is in progress)and then check that ownership has not been given to itwhatever the status of the failed processor. The singlefailure recovery algorithm already carries out these checksfor all status values other than dontwant, so that the onlymodification required is for the recoverer to carry out thetest for all possible status values whenever the failed pro-cessor is not initially the owner of the resource.

6 Ring size

The description of the hybrid algorithm [8] uses anabstract resource ring of fixed size. The size of the ring canvary dynamically if the function next computes the nextprocessor in the ring from dynamic shared data definingthe ring topology.

An add processor routine would insert the new pro-cessor into the ring with status set to dontwant. Theremove processor routine would reroute the ring aroundthat processor, but this can only be done if the processor isnot claiming or using the resource, i.e. status is dontwant,not putting and not owner. Changing the ring has no addi-tional effect on fault tolerance as long as the ring modifi-cations are indivisible operations.

Normally, the majority of processors in the system willbe running; only occasionally will any of the processors bein a failed state. There is little overhead in leaving failedprocessors in the ring, as there need only be one redundantinspection of a flag in both the get and put operations.Both the allocation and the recovery algorithms describedconsider the ring to remain static in size.

7 Simulation

The Neptune four processor system was used to check, asfar as possible, the correctness of the recovery algorithm. It

was implemented in Fortran with the extra parallel con-structs provided by a preprocessor [1].

The processors were programmed to claim, use andrelease a resource at random time intervals to simulate anormal usage pattern. At random times each processormade a choice independently of other processors and, if norecovery was already in progress, forced that processor tofail. The actual failure was simulated by means of a sharedboolean array (one flag per processor). Before a processorwas permitted to alter any of the resource ring shared datastructure, its flag was tested and the update was only per-formed if the flag was not set. In this manner a processorwas effectively halted at a random point, as far as theresource claim and release was concerned. There could,however, be one further update carried out after setting theflag, this problem was overcome by delaying the recoveryprocess until the failed processor had acknowledged thefailure.

Once the recovery had been completed, the failed pro-cessor was allowed to restart executing resource claim andrelease cycles.

A test was inserted to report a fault condition if two ormore processors were given access to the resource at thesame time. If the processors had become deadlocked thefault would have been detected by task time-out limits.

The simulation provides a way of snap-shotting thestate of a processor during resource usage and thus pro-vides a guide to the frequency of occurrence of the variousresource states. The actual frequencies will depend on thelevel of demand and details of the implementation.

Table 4: Simulation results for high resonance load(1) putting = unowned and claimer = false

Failed processorstatus

dontwantneedingclaimingwant

Someone

74187292192

58160

(2) putting = someone and c

Failed processorstatus

dontwantneedingclaimingwant

Someone

2088490

2544

(3) putting = unowned and

Failed processorstatus

dontwantneedingclaimingwant

Resource ownershipelse Failed processor

3839423

100916

;laimer = false

Resource ownershipelse Failed processor

2 21828

346

claimer = true

Resource ownershipSomeone else Failed processor

4100

15

00

1330

(4) putting = someone and claimer = true

Failed processorstatus

dontwantneedingclaimingwant

Someone

0000

Total recoveries 264 339Number ofsimultaneouslyfailed processorsPercentage ofrecoveries

1

19.5

Resource ownershipelse Failed processor

0000

2

48.3

Unowned

21 326117

750*

Unowned

17126

0*0*

Unowned

5932

1004126

Unowned

0*0*0*0*

3

32.2

IEE PROCEEDINGS, Vol. 133, Pt. E, No. 2, MARCH 1986 83

Page 6: Processor failure recovery for a resource sharing algorithm

Two sets of results were obtained, one for a highloading of the resource (Table 4) where four processorsspent one-third of their active processing within a criticalcode section controlled by a resource and the other for lowloading (Table 5) where the four processors spent one

Table 5: Simulation results for low resource load(1) putting = unowned and claimer = false

8 Conclusions

Failed processor Resource ownershipstatus Someone else Failed processor

dontwant 67 006needing 41claiming 23want 1 736

(2) putting = someone and

Failed processor

1169126

15 759

claimer = false

Resource ownershipstatus Someone else Failed processor

dontwant 822needing 63claiming 0want 11

(3) putting = unowned and

Failed processor

78200

98

claimer = true

Resource ownershipstatus Someone else Failed processor

dontwant 44needing 0claiming 0want 3

(4) putting = someone and

Failed processor

00

860

claimer = true

Resource ownershipstatus Someone else Failed processor

dontwant 0needing 0claiming 0want 0

Total recoveries 235829Number of 1simultaneouslyfailed processors 12.5Percentage ofrecoveries

0000

2

37.0

Unowned

147 90170700*

Unowned

13030*0*

Unowned

4490

42810

Unowned

0*0*0*0*

3

50.5

twenty-first of their time using the resource. These resultsshow that under high load the resource was owned by aprocessor 91% of the time and that resource ownershipwas mostly passed by the resource master algorithmwithin the put routine, because the number of times withputting set is higher under high load than under low load.Under low load conditions the failed processor was in adontwant state 92% of the time.

The results show that all the states the simulator foundthe failed processor to be in were recoverable. The relativeproportion in each state gives an indication of the timespent in that state.

From the logic of the allocation algorithm it can beshown that some of the possible combinations of controlvariable values should never arise, these states are indi-cated by '*' in the table. The reasoning behind these asser-tions is given in Appendix 11.1. The simulation did notfind a processor in any of these states, but did produceseveral of the multiple failure hazards that required themodification discussed in Section 5.

The reliability of computer systems is becoming of greaterimportance as more applications are designed. The use ofclosely coupled multiprocessor systems provides one wayof improving this reliability. However, new requirementsfor safe access to shared resources are introduced. Theabove discussion has shown how an existing co-ordinationalgorithm may be modified to provide reliable resourcesharing in a multiprocessor environment despite the failureof one or more of the constituent processors. The modifiedalgorithm could provide the primitive operations requiredfor a high level resource management scheme on a distrib-uted system.

Although the authors are confident that the algorithmsare correct (and extensive testing of all states confirms thisassertion), a more formal approach using, say, the calculusof communicating systems [7] or a communicatingsequential process model [2] may provide a framework forformal proof.

The basic resource sharing algorithm is currently beingused to control access to the shared resources within theNeptune parallel processing system within the Departmentof Computer Studies at Loughborough University. It isplanned to incorporate the modifications which providereliable operation under processor failure at an earlyopportunity.

9 Acknowledgments

The authors are most grateful to the UK Science andEngineering Research Council for its support of the devel-opment of the Neptune system under grants GR/A 90302and GR/C 3869. They would also like to acknowledge thehelpful comments of the editor and the referees.

10 References

1 BARLOW, R.H., EVANS, D.J., NEWMAN, I.A., and WOOD-WARD, M.C.: 'A guide to using the NEPTUNE parallel processingsystem' (Dept. of Computer Studies, Loughborough University, 1981)

2 BROOKES, S.D., HOARE, C.A.R., and ROSCOE, A.W.: 'A theory ofcommunicating sequential processes', J. Assoc. Comput. Mech., 1984,31, (3), pp. 560-599

3 DIJKSTRA, E.W.: 'Co-operating sequential processes' (TechnicalUniversity, Eindhoven, Netherlands, 1965), or reprint in GENEUYS,F. (Ed.): 'Programming languages' (Academic Press, New York, 1968)

4 HULL, R., HALSALL, F., and GRIMSDALE, R.L.: 'Virtual resourcering: Technique for decentralised resource management in fault-tolerant distributed computer systems', IEE Proc. E, Comput. &Digital Tech., 1984, 131, (2), pp. 38-44

5 LALA, P.K.: 'Fault-tolerant and fault-testable hardware design'(Prentice Hall International, 1984)

6 LAMPORT, L.: 'A new solution of Dijkstra's concurrency program-ming problem', Commun. ACM, 1974, 17, (8), pp. 453-455

7 MILNER, R.: 'A calculus of communicating systems' m 'Lecture notesin computer science, Vol. 92' (Springer-Verlag, 1980)

8 NEWMAN, I.A., STALLARD, R.P., and WOODWARD, M.C.:'Combined resource-sharing algorithm' IEE Proc. E, Comput. &Digital Tech., 1984,131, (2), pp. 55-60

9 NEWMAN, I.A., and WOODWARD, M.C.: 'The reliable sharing ofpassive resources in a multiprocessor environment' Report no. 45,Dept. of Computer Studies, Loughborough University, 1977

10 RANDELL, B.: 'System structure for fault tolerance' Proc. Interna-tional Conference on Reliable Software, 1975, pp. 437-449

11 RANDELL, B., LEE, and TRELEAVEN: 'Reliability issues in com-puting system design', ACM Comput. Surveys, 1978, 10, (2), pp.123-165

12 WEIHL, W., and LISKOV, B.: 'Implementation of resilient, atomicdata types', ACM Trans. Program. Lang. & Syst., 1985, 7, (2), pp.244-269

13 WOODWARD, M.C.: 'A procedure for the reliable update of datastructures in the presence of certain failures', Report no. 46, Dept. ofComputer Studies, Loughborough University, 1977

14 WOODWARD, M.C.: 'Algorithms for reliable sharing of resourcesusing a resource master concept', Report no. 47, Dept. of ComputerStudies, Loughborough University, 1977

84 IEE PROCEEDINGS, Vol. 133, Pt. E, No. 2, MARCH 1986

Page 7: Processor failure recovery for a resource sharing algorithm

11 Appendix

11.1 Illegal resource statesIt is not at all easy to see which of the possible states of theresource control variables can occur during the normalrunning of the allocation algorithm. Tables 1, 2 and 3 listthe states according to the status of the failed processorand the ownership, while the results in Table 4 also givethe values of putting and claimer. It can be seen that allthe possible values of status and owner can occur butsome combinations of putting and claimer cannot. It

should be noted that both putting and claimer can changein value after the failure has occurred.

The following tables explain (informally) one way inwhich some of the less frequent states can arise or whythey should never happen. Some of the states require atleast three processors in order for the state to be reached.The status value of dontwant normally means that theprocessor is not involved in resource allocation and thestate is dependent on other processors and so is not gener-ally considered. An asterisk ('*') in the column after thevalue of owner indicates that the state is illegal.

(1) putting = unowned and claimer = false

Status Owner Reasoning

dontwant me

claiming unowned

want unowned*

This processor is not involved in a put and so it has been given ownership of the resourcewhen it did not require it and so is illegal. However, as pointed out in the section on therecovery algorithm, this state can happen if the processor has failed and been recovered and infact occurred frequently during the simulation.

Can occur if the resource is unowned and this processor is executing (on Fig. 1 line 81) getbefore entering barter.This is one of the more obviously illegal states, because with claimer not set no processor is inthe barter and so this processor would wait passively for an indefinite time until anotherprocessor claimed and became the owner.

(2) putting = someone and claimer = false

Status Owner Reasoning

This processor has started a claim (Fig. 1 lines 79 and 47-50) at the same time as the currentowner is in put (Fig. 1 lines 71-73 and 51-66).

This can occur if the current owner was the successful barterer, it cleared claimer and thenstarted a put (Fig. 1 lines 71-74 and 51-66) while this processor was still in the barter (Fig. 1lines 81-82 and 30-39).

Because putting is set the last owner is still in put, just after setting the ownership unowned,but this processor is claiming and has therefore been ignored, the wait for put complete pro-cedure (Fig. 1 lines 47-50) ensures this situation should never arise.

For very similar reasons as for the previous case, this state is illegal as a processor's claim hasbeen overlooked and the processor might therefore wait indefinitely for ownership of anunused resource.

needing

claiming

claiming

want

someone else

someone else

unowned*

unowned*

(3) putting = unowned and claimer = true

Status Owner Reasoning

This is normally an illegal state as for the case of claimer not set, but can occur when therecovery algorithm is being used.

The processor must have been given ownership while it was in the first few lines of get (Fig. 1lines 79 and 47-50), another processor must still be in get because claimer is set. This statetherefore requires three processors, one to have started a claim, two to have been barteringone of which gained ownership and has now given ownership to this processor.The simplest way of reaching this state is if this processor successfully claimed for an unownedresource and is about to clear claimer (Fig. 1 line 38).

This processor has just started to get (Fig. 1 lines 79 and 47-50) while the current owner hassuccessfully bartered (Fig. 1 line 38).Because claimer is set a previous ownership change involved a barter, this processor foundclaimer set (Fig. 1 line 81) and so is waiting passively for ownership (Fig. 2 lines 84 and 40-46).The current owner is at (Fig. 1 line 38).

This processor has started a get (Fig. 1 lines 79 and 47-50) because claimer is set anotherprocessor is in the barter and will gain ownership (Fig. 1 lines 30-36).

dontwant

needing

claiming

needing

want

needing

me

me

me

someone

someone

unowned

else

else

IEE PROCEEDINGS, Vol. 133, Pt. E, No. 2, MARCH 1986 85

Page 8: Processor failure recovery for a resource sharing algorithm

(4) putting = someone and claimer = true

It is expected to be extremely rare for the two flags to be set in this way. It requires the owner to be in the process ofputting (Fig. 1 lines 71-74 and 47-66) and another processor to be still in the process of bartering. This requires two ormore processors to take part in the barter (itself unlikely) one of them to go on and become owner while the other waitsbefore line 29 and after the new owner has cleared claimer (Fig. 1 line 38), executes line 29. The barterer then waitssomewhere before line 38 until the owner has started a put.

Status Owner Reasoning

This processor is the owner of the process of put within findclaimerorrelease (Fig. 1 lines73, 53-64).

Ownership has just been given to this processor. Requires at least three processors to beactively accessing the resource.

Ownership was given to this processor (Fig. 1 line 58), it could not have taken ownershipbecause putting is set and the resource can not therefore have been unowned. This processor isperforming the barter (Fig. 1 lines 81-82 and 27-39).

Processor is in get (Fig. 1 line 84) waiting passively for ownership (waitforownership Fig. 1lines 40-46) while the current owner is performing a put.

This processor has started to get (Fig. 1 lines 79 and 47-50), and therefore at least threeprocessors are required to reach this state.

This is the processor that has set claimer (Fig. 1 line 29), and is still in the barter (Fig. 1 lines82 and 30—37). The current owner is in put.

Requires three or more processors as this processor is waiting passively for ownership (wait forownership Fig. 1 lines 84 and 47-50).

All values of status are illegal because when ownership is unowned and both putting andclaimer are set an illegal state has been reached irrespective of this processor's status. Becauseputting is set a processor is still in put and has just set the resource unowned, but becauseclaimer is set there are one or more processors in the bartering section of get and one of themshould have been given ownership of the resource.

dontwant

needing

claiming

want

needing

claiming

want

dontwantneedingclaimingwant

me

me

me

me

someone else

someone else

someone else

unowned

*

86 IEE PROCEEDINGS, Vol. 133, Pt. E, No. 2, MARCH 1986


Recommended