+ All Categories
Home > Documents > Highly Available Lock Manager for HA-NFS

Highly Available Lock Manager for HA-NFS

Date post: 09-Apr-2018
Category:
Upload: rajdeep-bhattacharya
View: 219 times
Download: 0 times
Share this document with a friend

of 8

Transcript
  • 8/8/2019 Highly Available Lock Manager for HA-NFS

    1/8

    A Highly AvailableLockManagerFor IIA-NFSAnupamBhide IBM T. J. WatsonResearch enterSpencerShepler IBM Austin

    ABSTRACT- . Iltt. papg_rPres4s he designand implementation f a highly available ock managerfor highly availableNFS (HA-NFS).HA-NFS provideshighly availablenetwork ile servlceto NFSclientsandcan be usedby any NFS client without modification.This is providedbyhaving wo servers haredual-ported isksso thatone servercan takeover the oher'sdisksand file systemsf it fails. Making the NFS servicehighly available s not enoughsincemany applications hat use NFS also use other servicesprovidedwith NFS suc as thenetwork ock manager.We describea schemewherebyeahserver ransfersenoughof itslock state o the other so that if it fails, the other servercan go througha lock iecoveryprotocol. Our designgoalwas to make he overhead f transferringhe stateduring ailure-freeoperation s ow aspossible.

    1. IntroductionThis paperpresentshe designand mplementa-tion of a highly availableock manageror a HighlyAvailableNetworkFile Server HA-NFS) [3]. HA-NFS provides oleranceo file server,disk and net-work failures and can be usedby any NFS client.Recovery rom server ailure is providedby having.two serversshareaccess o dual-porteddisks andprovidebackupservice or eachother.Theseserversare thereforereferred to as twins of each other.However, t is not enough 0 recover he file serverstateat a backupserver n caseof a crash. MostNFS implementationsre accompaniedy a networklock managerso that clients can obtain locks forfiles that are remotelymounted, NFS file locking isan extension f local file locking and was designedso that applications an use ile lockingwithouthav-ing to know whether the file is local or remote.Most NFS implementationsupporra lockf)lfcntl},SystemV[1] style of advisory ile and record ockingover the network. A numberof applications se henetwork lock managerto synchronizeaccesstoshared iles and to preventmultiple processesrommodifying he same ile at the same ime. Since

    locking is inherentlystatefuland NFS s supposedobe stateless,he lock manager s implementedseparatelyrom NFS.When the primary server ails, the lock statemust be recovered t the backupserver. This paperwill describea design for recovering he lockingstate at the backup n caseof server ailure and forenablinga failed serverwhich is recoveringand re-integratingo regain ts lockingstate.\lte have implemented a prototype of theHighly Available lock manageror HA-NFS on thesameplatformon which HA-NFSwas implemented:a networkof workstations nd two file servers rom

    the IBM RISC System/6000amily of compuringsystems unning he AIX Version3 (AIXv3) operat-ing system,andconnected y eithera 10 Mbitia Eth-ernet network or a 4 Mbit/s or a 16 Mbit/s tokenring network. We constructeddual-porteddisksfrom off+he-shelfSCSIdisks attachedo a SCSIbusthat is sharedby the two servers. The prototype soperationalndhassatisfiedhe design oals.In section2, we presentbackgroundnforma-tion on HA-NFS. In section3, we describehe NFSlocking protocol. In section4, we desibevariousdesignalternativeso enable ock state to surviveprocessor ailure and recovery events. Section 5describeshe designwe choseand why. Section6describeshe re-integration rotocolexecutedwhenafailed server ecovers.Section7 presents n evalua-tion of the design rom the point of view of imple-mentation effort and performance. Section 8describeshe technique sed o recover rom mediaand network failures. Section 9 compares ourapproach ith otherapproachesnd section10 pro-poses ome tems or future work.2. The Highly Available Network File System(rrA.NFS)

    Traditionalapproachesor providingreliabilityin networkedfile systems use server replication.HA-NFSdiffers rom tradtionalapproachesn that ttoleratesserver ailures by using dual-porteddisksthat are accessibleo two servers,each acting as abackup or the other and henceare calledwins ofeach other. The disks are divided into two sets,each servedby one serverduring normaloperation.Eachservermaintainson ir disksenough nforma-tion to reconstructts current volatile state. SinceNFS s an almost tatelessrotocol,he only volatileinformation s the duplicate ache nformation hat sneeded to detect duplicate transmissions.ForSummer92 USENIX June8.June12,]rgg? SanAntonio,TX L77

  • 8/8/2019 Highly Available Lock Manager for HA-NFS

    2/8

    A Highly Available Lock Manager For IIA-NFSexample,a "create new file" remoteprocedure all(RPC)may reacha serverand the file createopera-tion maytakeplace,but the acknowledgemento theclientcouldbe lost. The client would re-try the RPCand may receive an error becausehe file alreadyexists as a result of the previousRPC unless heRPC was flagged as a re-try. To detect this, theserver stores a cache of recently executedRPCscalled he "duplicate cache". For furtherdiscussionof this opicsee 4].The two servers periodically exchangeliveness-checking essages. f one server ails, thefailed server'sdisks will be takenover by its twinserver.The twin then reconstructshe lost volatileduplicatecachestateusing the informationon disk.Then the twin impersonateshe failed server by tak-ing over its IP address nd operation ontinueswitha potential reduction in performancedue to theincreased oad. The clients on the network areoblivious to the failure and continue o access hefile systemusingthe sameaddress. During normaloperation,he servers ommunicate nly for periodicliveness-checking. he serversdo not maintainanyinformation about each other's volatile state orattempt o accesseachother's disks during normal(failure.free)mode of operation. HA-NFS adheresto the NFS protocol standardand can be used byexistingNFS clientswithoutmodification;HA-NFS is implemented n top of the AIXv3log-basedile system. The AIXv3 file systempro-vides serializableand atomic modificationof filesystemmeta-data y usingtransactionalockingandlogging echniques.File systemmeta-data re com-posed of directories, nodes, and indirect blocks.Every AIXv3 system all that modifies he meta-datadoesso as a transaction,ocking meta-data s theyare referenced, nd recording he changesn a disklog beforeallowing the meta-datao be written tothei "home" locationson disk. In the caseof sys-tem failure, he meta-datae restoredo a consistentstateby applying he changes ontainedn the log.The reliability of ordinary files is ensuredby NFSsemantics, hich require orcingthe file data o diskbefore sendingan acknowledgemento the client.The volatile stateat an NFS serverconsistsof theduplicatecache; his information s recordedon thedisk log so that it can be recovered y the backupserver. Further details about the design and theimplementation f HA-NFScan be found n [3].

    3. The NFS Locking ProtocolIn this section,we will provideandoverviewofthe NFS/ONC ocking protocirl. The locking proto-col is implementedoutside of the NFS protocol,because he NFS locking protocol is stateful andNFS is designedo be stateless. n most mplemen-tations the file locking protocol is actually imple-mented n two daemons.The daemons re usuallynamed rpc.lockd and rpc.statd and these ae the

    Bhide,Sheplernamesused n AIXv3. The rpc.lockddaemonat aserver handles locking requests or NFS clientswhich are accessingfrles at the server. Therpc.lockddaemonacts as a surrogateat the serverfor client processes nd keeps rack of what locksare held by clientsat any one point in time. At aclient, the rpc.lockd keeps rack of what locks areheld at the NFS serverby the variousprocesses.The seconddaemon,he rpc.statddaemonat aserverkeepsa list of clients that are to be trackedfor system ailure.Similarly at a client, this daemonkeeps track of what remote serversare cunentlybeing accessedy file lock requests.Getting A lockWhenan applicationat a client makesa systemcall requestinga lock on a NFS-mountedile, theclient kemel makesa RPC to the client's rpc.lockd.This rpc.lockd then sends he lock request o therpc.lockdat the serverwhich makesa lock requestto the server's kernel. The server's kernel acceptsthe ock request ndreturns he appropriateesponseto the server's pc.lockd. The server's pc.lockdwillthen espondo theclient's pc.lockdwith theresult.It in turn will respond o the client's kernelwhich 1will then eturn he responseo the application.When the client's rpc.lockd eceives he origi-nal lock requestrom the client kernel, t will regis-ter the server's ostnamewith the client's pc.statd.This is done beforethe lock request s sent to theserver's pc.lockd o be processed.Upon receivingthe lock request rom the client the rpc.lockdat theserverwill register he host nameof the client withthe rpc.Statd t the server. The rpc.statds n boththe client and server ecord he host nameson diskso that it can be acessedfter a failure. This regis-trationprocesss done or the fist lock request nly.The rpc.lockdkeeps nternalstateaboutwhich hostsit has registeredwith the rpc.statdso this initialregistration step is skipped on subsequent ockrequests.RecoveryActions Upon Client/ServerFailureIn a standard NFS (not HA) serverconfiguration,here s a method o rebuild the lock-ing state hat is kept by the NFS server. Rebuildingof stateoccurs only after server ailures. It is notneededafrer client failure and recoverysince theapplicationshat took the locksno longerexist afterthe client's system ailure. The only actionsthatneed o be taken whena failed client recovers s totell relevantserverso releasehe ocks that are heldon the client's behalf.The following explains he conesponding tepsthat the NFS server rpc.lockd/rpc.statdollow torecover ockingstate. The rpc.statds startedbeforethe rpc.lockdduring system nitialization. When herpc.statd on a server restarts, assuming systemfailure, it reads rom disk the namesof systems twas monitoringduring ts previous ncarnation.The

    L7E Summer 92 USEMX - June 8.June12,1992- SanAntonlo; TX

  • 8/8/2019 Highly Available Lock Manager for HA-NFS

    3/8

    Bhide,Sheplerrpc.statdhen nformseachof the rpc.statds n theseclient systemsabout the server's failure. Clientrpc.statdshen nform their rpc.lockds bouta serverfailure.In the casewhere he rpc.lockdprocess n theserverhas ailed,the rpc.lockdprocess uring nitial-ization will inform the local rpc.statd hat it hadfailed and goes into a grace period in which itacceptsonly lock reclaim requests rom clients.Whena client rpc.lockd s informedof server ailure,it goes through its lock table and re-requests rreclaimsall locks it had held at that server. Afterall clients go though this protocol, he server hasnow regainedhe lock state hat was heldbefore hefailure.In the casewhere he rpc.lockdprocess n theserverhas ailed,the rpc.lockdprocess uring nitial-ization will inform the local rpc.statd haa t hadfailed and goes into a graceperiod in which itacceptsonly lock reclaim requests rom clients.Whena client rpc.lockd s informedof server ailure,it goes through ts lock table and re-requestsrreclaimsall locks t hadheld at thatserver.After allclientsgo through his protocol,he serverhasnowregained he lock state that was held before thefailure. If the client system ails, the rpc.statd t theclient will notify the serversof client failure. Thelist of servers s built from the list of monitoredservershat the rpc.statdwas keepingn the file sys-t-ery f the client. Uponnotificationof client systemfailure, the server's pc.lockdwill releaseall f thefile locks hatwereheldby thatclient.

    4. DesignAlternativesTo designa highly available ock managerorHA-NFS, a way mustbe found to transfer he lock-ing stateheld by the primaryHA-NFS server o thebackupor twin HA-NFS server. This needs o bedone so that correctness an be maintained n theoperation f the NFS server rom the client,sper-spective.The first approachhat might be taken s to fol-low the same generalscheme or transferring helock state hat the HA-NFS serveruses o trnsferfile systemstateand duplicate acheentries o thetwin HA-NFS server. Recall that duplicatecacheentries ae used to detect request e-lransmissionsthat occur when acknowledgementset lost. TheHA-NFS serverstores he duplicatecacheentry inthe file systemog when he duplicate ntry s ini-tially addedo theduplicate acheable. Thisworksbecause f the one o one mappingof the duplicatecacheentry and the commitof the entry to th AIXJournaledFile System (JFS) log. For this samemechanismo work for the NFS lockingthereneedsto b_e mappingbetween he lock/unlockoperationsof the client and JFS loggingcommit pointswhichcorrespondo metadatamodicationpoints. Thismappingdoesnot exist and would not be possible

    A Highly Available Lock Manager For HA-NFSwithout a redesignof the logging services o serveother than normal JFS activity. This would alsomean that locking operationswould run at diskspeed.The secondapproachcould be to have therpc.lockdof eachHA-NFS server ransfer he lock-ing state o the twin HA-NFS server. This wouldneed to be done with each positive responseo aclient's ockingrequest.Before he primaryserver'srpc.lockdsends ts positive/grantedesponseo theclient, it would have to call the rpc.lockdon thetwin HA-NFS server. The rpc.lockd on the twincould then build the same ocking stateas the pri-mary server. This approachwould have a perfor-manc impact on each locking operation. Thiswould alsoaffect he twin's locking performance ndgeneralsystemperformance ince t would be field-ing the same ockingrequestshat the primarywouldbe handling. Another drawbackto this approachwould be the added omplexityof keeping'lckingstate for the nvin and differentiating hat lockingstate rom the ocal lockingstateof thetwin.The third approach s similar to the second,except that instead of all positive lock/unlockresponseseing passed o the twin, host namesofnew clients are passedo the twin on the client,sfirst lock request.Thus, the rpc.statdwould be theone passing tate nformation o the twin's rpc.statd.Recall hat he rpc.statds contacted y the rpc.lockdon the first lock request f a client. The rpC.statdstold to monitor that client. The rpc.statd n turncreates file in the directory etclsm. This file namematches he host name of the client making therequest. With this procedurehe rpc.statdcan thenrecover he list of clients hat were beingmonitoredbefore ailure. After the rpc.statdhasplaced he filenamedafter he client in the /etclsmdirectorv. t willcontact the twin's rpc.statd. The twin's rpc.statdalso placesan entry in the /etclsm directory thatcorrespondso theprimary erver's ntry.With the conespondingetclsmentries n placeon the HA-NFStwin, all it has o do during akoveris to go through he lock recoveryprotocolplayingthe role of a recovering erver or both itself and ttwin. The clientsof the failed serverand thoseofthe twin will executeock recoveryand he twin willeffectivelyrebuildthe locking stateheld by the pri-mary server.

    5. Our DesignThe third designalternativenvolves the leastoverhead during normal failure-free operation.Tables 'J.and 2 show the details of this rotocoi.Table 1 shows how a lock is obtained. Table 2explainshow lockingstate s re-establishedt thebackupaftera server ails. This design equireshatthe rpc.loccdbe stoppedand restartedduring thetakeoverprocsso force the lock recovery o occur.Onedisadvantagef thisschemes that ockingstate

    Summer 92 USENIX- June8-June12,Lgg2 SanAntonio.TX t79

  • 8/8/2019 Highly Available Lock Manager for HA-NFS

    4/8

    A Highly Available Lock Manager For IIA-NFShas to be rebuilt insteadof being alreadyavailableas n the second cheme r getting t from the log asin the first scheme. We consideredhis trade-offacceptableincewe were gettingbetterperformancein the normalcase or sacrifrcing omeperformanceduring ecovery rom failure.For simplicity n the chosendesign,we decidedto let the rpc.lockd reclaim lock state or itself inaddition o the failed server. It would be possibleomodify the rpc.lockdso that it would not drop itsown locking stateduring recoveryof the twin HA-NFS server'sockingstate.

    o An application equests lock on a file thatresidesn anNFS file svstem.o The NFS client's kernil makesa RPC to theclient'srpc.lockd equestinghe ock. If this is the first lock request or the server,the rpc.lockd on the client registers theserver'shost namewith the rpc.statd n theclient.o The client's rpc.lockdsends he lock requestto theserver's pc.lockd.o If the lock request s the fust one receivedfrom this particularclient, the rpc.lockd egis-ters the client's host namewith the rpc.statdon the servero The server's rpc.statd hen informs its twinrpc.statdon the backupserver hat the clientshouldbe monitoredand then sendsan ack-nowledgemento the rpc.lockd.o The server's pc.lockdmakes he lock requestto theserver's ernel.o The server'skernel accepts he lock requestand validates it. Returns response o theseler's rpc.lockd.o The server's pc.lockd espondso the client'srpc.lockd.r The rpc.lockdon the client respondso theNFS client'skernelwith the ock response.o The applications given he answero its lockrequest.Table 1: Gettinga lock on remote ile in IIA-NFSThe rpc.statdon a system s. nformed of thenameof its vin host througha RPC call by a HA-

    NFS daemon when the HA-NFS subsvstem sstarted. Once his is done, he rpc.statdwiil contactthe twin's rpc.statdevery time it is called by thelocal rpc.lockdwith a new host o be monitored. hetwin's rpc.statdwill createan entry in the /etc/smdirectory with the host name specified by theprimary's rpc.statdand respond o the monitoringrequest.If a server ails, the HA-NFS daemonsunningon its vin will detect he failure and take over itsdisks. They will then replay the log and bring thefile systemso a consistent tateand mount hem onthe appropriatedirectory. Finally, they will take

    Bhide,Sheplerover the IP address f the failed server on a sparenetwork interfaceprovided for this purpose. Thenetwork interfacemay be either ethernetor token

    o The failure of the trvin server s detected ythe backup HA-NFS server. The backupserver akesoverthe disks,brings he file sys-tems to a consistentstate and rebuilds theduplicate cache of the failed server. Therpc.lockd s stoppedo prevent equestsrombeing processed ntil takeover s complete.The backup hentakesover the IP addresofthe failed serverand starts o provideNFS fileservice.o The rpc.lockd is restarted at the backupserver.When t starts, t contacts he server'srpc.statdo tell it of its failure.o Upon receiving the failure notification fromthe rpc.lockd, the rpc.statd at the backupserver contactseach of the clients (both itsown clients as well as those of the failednvin) that were being monitored becauseofthe cunent locking state. This notificationlets eachof the clients know that the server'srpc.lockd has failed. The rpc.statd notifreseach of the clients playing the role of theappropriate erver.o After the rpc.lockdnotifies he rpc.statd f itsfailure it goes into a grace period for lockrecovery.This allowsclients o reclaim ocksthat they held before the failure of the Winservers pc.lockd. The defaultgraceperiod s45 seconds.

    o lVhen the client's rpc.statd receives thenotification hat the serverhas ailed, t "callsback" to the rpc.lockdon the client to notifyit of the failureof the server.o Whennotified, he client's rpc.lockdwill sendreclaim requestsor all locks currently beingheld that are from the failed server. Thesereclaim equests ill be honoredat the server.As a result he serverwill regain he lockingstate hat washeldprior to its failure.o During the grace period on the server, therpc.lockd will only honor reclair?r equestsfrom clients. This assures onsistencyor theclients that held locks prior to the failure.The lockswill be held againwhen the serverrestartsnormal ock service. Regular ockingrequests hat the server's rpc.lockd receiveswill be returnedo the requestinglientwith amessagehat the client should retry the lockrequest.After the graceperiod elapses, ewlock requests tartbeinghonored.Table 2: Ick RecoveryAfter Server Failure inHA-NFSring. Finally the rpc.lockd s restarted. Because fthe restart, the rpc.lockd and rpc.statd willthrough the lock recovery protocol sending goout

    18 0 Summer '92 USEND( - June 8-June 12,1992 San Antonio, TX

  • 8/8/2019 Highly Available Lock Manager for HA-NFS

    5/8

    Bhide,Sheplerrequestso clients of both this serverand its failedtwin to perfofm lock re-claim actions. The clientswill see a simultaneousailure of both the twinserversand sendout reclaim requestso both. Alltheserequestswill be receivedby the operationaltwin (since t is respondingo both IP addresses)ndthe correct ockingstatewill be recoveredAnother detail of the rpc.statdmodifrcationsdealswith contactinghe clientsupon server ailure.The clients are notified of the server failure bvreceipt of an RPC. Within the parameters f thRPC is the host nameof the server hat has failed.The rpc.statdat the client uses hat host name check if it among the list that is cunently beingmonitored. [f so, the rpc.statdhen sendsan RpC tthe rpc.lockd at the client signifying the serverfailure. A simple solution would be for the twinrpc.lockdupontakeover o contacteachclient withthe nameof the failed serveras well as ts own hostname. To get around his overhead,he file with theclient's name in the /etclsm directory indicateswhether t is this server hat needs o monitor theclientor its twin.A call to unmonitora host needs o be sup-ported for the twin-registrationprocess. This lsneeded o that whena primary's pc.lockd ecidestno longerneeds o monitora client, the rpc.statd nboth the primary and twin systemswill rmove hemonitoringentry from the /etclsmdirectory. Whenthe rpc.statd n the primaryHA-NFSserver eceivesan unmonitor request from the rpc.lockd it willremove ts /etclsmentry for that host. The rpc.statdwill thencall the twin's rpc.statdwith the requestoremove the host from its monitoring tables. Therpc.statdon the twin will decrementhe referencecountof that monitoredhost. If the reference ountis zero, it will remove ts entry from the /etc/smdirectory.

    6. ReintegrationWhen a failed server recovers, he HA-NFSdaemon at that server will recover the duplicatecachestate rom its trvin after taking over the filesystems.The failed serverwill also receive he roc.statd

    state rom the twin. The mechanismo hanle hisis achieved hrough he twin registrationprocessatthe twin. After the duplicate cache state istransferredo the recovering erver, he hostnameofthe recoveringserver s registeredwith the twin,srpc.statd.By designhe rpc.statd"willransfero theregistered erver he full list of hostnames hat it iscurrently monitoring. This allows the recoveringseryerto obtain the current ist of clients that arebeingmonitored.After receiving he list of clients, he recover-ing server hen restarts he rpc.lockd. Through tsnormal recoveryprocesseach of the clients in the

    A Highly Available Lock ManagerFor IIA-NFStransferredmonitoring ist are contactedand told ofthe failureof the recovering erver. The rpc.statdatthe notifiedclient will then follow the normal lockrecoveryprocess y contactinghe client's rpc.lockdallowing t to reclaim he client's locks.

    The twin at this point will restart ts rpc.lockdand it will also havethe locking staterebuilt whenthe client'sreclaim heir locks. The rpc.lockdon thetwin was originally stopped o that the file systemsof the recovering system could be unmunted.When he rpc.lockdexitsgracefully, he locksthat itholds are released rom the file systemthereforefreeing he file systems f reference ounts hat mavprevent hem rom beingunmounted.7. EvaluationThe efort it took ro implement the designchosenwas minimal. A simple RPC programwas

    designed o handle the trvin registrationand thetransfer of host monitoring requests betweenrpc.statd's. The rpc.statdcode was structured nsuch a way that the extra logic required o imple-mentour designwas small. Most of our effon wasspenton understandinghe designand implementa-tion of the rpc.lockd and rpc.statd prior tomodification.After the designof these wo daemonswas understoodt was straightforwardo designandimplementhe methodchosen.Thereare hreeareaswhere he performance f a HA-NFS serverand itsclients will be affectedby our design or the highlyavailableock manager.This performance enalty sin comparisono a standardNFS serverand he smn-dard implementation f the network lock managerprotocol.1. The first performance enalty s takenwhen agiven client makes he very first lock requestof the I{A-NFS server. Contacting he twinserver or the monitoringof the clientwill addextradelay n respondingo the client. Nopenalty is incurred for subsequent ockrequests.2. The secondpenaltywill be paid when a twinfails and the twin that takesover ts identitvstarts o procsshe lock requests f its owiclientsand he clientsof the failed twin.3. The third penalty is incurred when the rein-tegrationof the failed twin occurs. The twinserver hat has taken over musi transfer hemonitoringstate to the recovering win. Inthis case he implementationas he rpc.statdforking a child that handleshe transferof rhemonitoring tate.Becausehe first two penaltiesare more important,they will be discussedn furtherdetail.PenaltyFor The First Lock RequestIn both he standard FS andour lock managerdesigns,he first time a client makesa lock requesqthe rpc.lockd ontactshe rpc.statd, ndasks ht theclient's name be placed in the /etclsm directory.

    Summer 92 USENIX- June8-June12,Lggz SanAntonio.TX 181

  • 8/8/2019 Highly Available Lock Manager for HA-NFS

    6/8

    A Highly Available Lock Manager For HA-NFSThis is done by creating a file name whichcorrespondso the host name of the client. In ourdesigr, he rpc.statdat this point of first registrationwill also contact the twin's rpc.statd and have itmonitor the sameclient. The rpc.statdmakes hisRPC to the twin's rpc.statdandwaits for a responsebefore responding to its rpc.lockd monitoringrequest. This means hat the rpc.lockdwill not beable to continueprocessinghe lock requestof theclient until the rpc.statdat the twin finishescreatingits file that conespondso the nitiating win.This extra overheadof contacting he twin'srpc.statdwill obviously delay the response o thefirst lock requestof the client. Table 3 illustratesthe impact of this delay. A configurationof RiscSystems/6000s ere used o measurewhat the costof this frrst lock requestwould be in a HA-NFSenvironment. Three svstemswere used for thesemeasurements. hey were running AIXv3 with theHA-NFS subsystemnstalled and configured. Therpc.lockdand rpc.statdhad beenmodifred o followour design. The threesystemswere solatedon a 16Mbit/s token ring. The test case hat was executedreset he systems o that no previous pc.statd tatewas held at any system client or sewer). The testcase at the client mounted one of the HA-NFSexported ile systemsrom the HA-NFS serverpair.The test case hennoted he current ime and ssueda systemcall to obtain a lock on a file in the NFSmounted directory. After the lock system callreturned, he cunent time againwas notedand theelapsedime measured.This is reportedon the rowlabelled "First lock"

    Bhide, SheplerThe test casewent on to do the samesequence flock and unlock requestsa second time. Thissecond terationhoweverdid not reset he rpc.statdstate. Therefore he overhead f obtaininga secondlock was measured.This is reportedon the rowlabelled Second ock".This test casewas also executedwith a stan-dard NFS server for comparison. The sameconfiguration escribed bovewas usedexcept hat twas just one standardNFS serverand one client.This time the rpc.lockdand rpc.statdwere runningthe standard lgorithm.Again the rpc.statdstateonbothclient and serverwas removedand he test caseexecuted. The second ock requestwas also exe-cutedas n the scenario escribed bove.Both of these ock testscenarioswere executed50 times and an average esponseime for the lockrequest nd standard eviationcalculated.Thesearethe results eportedn Table3.The overhead f contacting he twin server ohave he rpc.statdmonitor the client almost doublesthe responseime for the very first lock request. Wefelt that this cost was reasonablegiven that itoccuredonly for the first lock requestrom a partic-ular client. The numbers or the "second lock"request with and without tlA-NFS are the samewithin the imits of experimental rror.The samewas done or the unlock requestandagain he elapsed ime reported. Table 4 shows hemeasurementsor the unlock requests. This datashowsthat HA-NFS unlock requestsdo not sufferfrom any extra overhead.

    WithoutHA-NFS With HA.NFSLock Opm Std.Dev. hck Oprn Std.Dev.FirstLock 132.12ms 13.28 s 245.06ms 70.18msSecond ock 15.66ms 0.98ms 16.08ms 1.20msTable 3: Highly AvailableLock ManagerOverheadsor locks

    WithoutHA-NFS With HA.NFSUnlock Oprn Std.Dev. UnlockOprn Std.Dev.First Unlock 14.30ms 0.41ms 14.42 s 0.80msSecond nlock 14.43ms 0.47ms 14.63ms 0.60msTable 4: Highly AvailableLock ManagerOverheadsor unlocks

    WithouttlA-NFS With IIA.NFSLock Oprn UnlockOprn LockOprn UnlockOprnFirst Run 14.80ms 13.89 s 15.19ms 13.97msSecond un 14.82ms 13.85 s 14.91ms 13.90msThird Run 14.80ms 13.89ms 15.17ms 13.98msTable .5: Highly AvailableLock ManagerOverheads500 operations)

    LEz Summer 92 USENIX- June8-June 2,1992 - SanAntonio,TX

  • 8/8/2019 Highly Available Lock Manager for HA-NFS

    7/8

    Bhide,SheplerWe believethat the typical mode that clientsuse seryers n most applications s that a clientwould tend to get many ockson a particularserverin a given period.This would tend to washout theeffect of the higherHA-NFS first lock overhead sgomparedo the standardNFS lock manager.Table5 showsa testcase hat executed 00sequentialockoperationsn the sameconfigurationspecified bove(unlockoperationswerealsomeasured).The resultsreported re he per request lapsedime.OverheadOf HandlingThe Failure Of A TwinThe steps hat are akenwhena server ails andits twin takes over operation or the failed serverhave beenenumeratedn Table 2. Rememberhatone of the steps s to stop the rpc.lockdwhile thetakeover configuration s executingon the twin.Once the takeovers complete, he rpc.lockd s res-tarted. At this point, the rpc.lockdwill notify therpc.statdof its failure and the rpc.statdwill executeits lock recoveryalgorithm.With our design,he rpc.statdwill have o con-tacteachof theclients wice. OneRPCwill containthe twin's hostnameand the secondRPC wifl con-tain the hostnameof the failed server. In this wavthe clients will be notified of both server's ailurand they will reclaim the locks held at the serverpair. With our design he rpc.statd asexactly wicethe normal number of RPC's to execute. Thisnumbermay also includeclients that held locks atthe failed serverand not at the twin that has takenover. Since his notificationmechanisms asvnchro-nous o the otheroperations ccurringat the server t

    shouldnot be a significantburden or the twin.Also, the twin will have to handle its ownincomingreclaimrequests nd the reclaimrequestsfor the failed server. This load is difficult to deter-mine since it dependson the type of applicationsthat are beingexecuted t the clientsand heir lock-ing behavior. Therefore under heavv stress thedefaultgraceperiod hat the rpc.lockduses or lockreclaimsmay not be sufficient or conectoperation.This is also true of a standardNFS servei but ismade worseby the fact that the twin will also behandlinghe ailedserver'sock requests.In the testing that has been done with thisimplementationhe recoveryof client locking statewas achievedn a reasonablemountof time. Themajority of this testing was done under light tomedium locking stress. Since the rpc.lockdhas arelativelyshortdefault ime for its graceperiod, herecovery process hat the clients are forced to gothroughmay fail undera very heavy oad. If thishappens,he graceperiod can be increasedby thesystemadministratoro handle his case.It shouldbe mentionedhat the lock responsetime will not be the only thing that will suffrwhena server fails. The normal NFS requestshat thetwin receiveswill also be affectedby the extrawork

    A Highly Available Lock Manager For IIA-NFSload that t has akenon aspartof the impersonationof the failed seryer.

    8. Networkand Media FailuresHA-NFS provides ecovery rom disk and net-work filures or the file serveras describedn [3].The samemethods an ensurehat the lock managercan recover rom these ailures.Fast recovery rom disk failures s achieved nHA-NFS by mirroringfiles on differentdisks.How-ever,all copiesof the same ile are on disks hat arecontrolledby the samefile seryer,eliminating theoverheadof ensuring consistencyand coherencebetweenhe two servershat would othenise ccur.Sincediskfailuresarenot frequent,minoring is onlyused or applicationshat requirecontinuous vaila-bility. Othenvise,archivalbackups ould be used orecover rom disk failures. The files used by therpc.statd ouldbe mirroredo providehighavailabil-ity. Network ailuresare olerated y optional epli-cation of the network components,ncluding thetransmissionmedium. However, packets are notreplicatedover the two networks. nstead, he net-work load is distributedover the networks. Clientsdetectnetworkfailure because f loss of heartbeatfrom serversand switchover to the secondnetworkby changingheir routing ables. Also, a messagessent o the server o changets routing ables. Thismechanismworks for all messages etweenclientandserver,ncluding he ockingprotocolmessages.

    9. Comparisonwith Other SystemsTandem'sNonStop architecture 2] t5] usesspecial-purposeardwaren the form of dual-porteddiskcontrollerswhich allow eachdisk to be attachedto two processors.f a single processorails, theother takes over the disks and providesprocessesthat were using thesedisks with continuedaccess.However,Tandemhas the concptof process-pairs.Thus each /O processhas a twin to which it con-tinuouslycheckpointsts state.This ensureshat thebackup I/O process knows what operations areneededo bring the diskto a consistent tatewhen ttakesover. On the otherhand,HA-NFShasno suchcheckpointing verheadduring normal (failure-free)operation.The information or bringing he disks oa consistent tate s storedon the disk itself by treat-ing eachNFS client-to-serverRPC as a transactionandwriting a log. Thus, here s a signifrcant iffer-encebetweenhe IIA-NFS and Tandemapproaches.Presumably n the Tandem approach the sameprocessor-pair/checointing pproach is used totransfer ockingstate rom oneprocsso its backuptwin. In our approach,he rpc.statdcommunicateiwith its twin rpc.statd nly whena newclient makesits first lock request.No communications requiredfor subsequentock requests.

    Summer '92 USENIX - June 8-June Z, tgg2 - SanAntonio, TX 183

  • 8/8/2019 Highly Available Lock Manager for HA-NFS

    8/8

    A Highly Available Inck Manager For HA-NFSVAXcluster [6] also has a distributed lockmanagerwhich recoversocksafter a processorails.Upon being notified of node failure, the lockmanager on each node must perform recoveryactionsbefore normal cluster operationcontinues.

    First, each lock manager deallocatesall locksacquiredon behalf of other processors.Only locallocks are retained. Next, each lock manageracquires ach ock it hadbefore he failure. The netresult s to deallocate ll locks ownedby the failednode. However,note that this requiresa// locks tobe re-acquired n any failure. In NFS and IA-NFS,clients hatheld ocksat a failed servernodeneed ore-acquireonly those locks that were held at thefailed node. The trade-off is that the first lockrequests slower n HA-NFS.10. Future Work

    We used a simple designwherethe twin of afailed server rebuilds both its own locking statealongwith the lockingstateof the failed server.Theload of this serverwould decreasef only the failedserver's ocking statewere to be selectively ebuilt.This can be done by having the rpc.statdkeep amore detailed ecordof what clientsweremonitoredby which server. This way only the clients affectedby a takeoverwouldbe notified of theserver ailure.The other part of the design that might beextended as to deal with the graceperiodthat therpc.lockd uses. The grace period is used o allowthe clients o rebuild heir ocking stateafter a serverfailure. This graceperiod n the implementations adefault of 45 seconds. As mentionedearlier, thisdefaultseemso work well with the test casesusedbut it may not be sufficient when the load of theserverncreases.There s a possibility hat the graceperiod could be dynamicallydecidedbasedon thenumberof clients hat respondo the failure messagethat the rpc.statd upplies.The rpc.statd ouldpossi-bly keep rack of the percentagef clients hat havecontacted he server after being notifred of thefailure. Once a certainpercentageas been eachedthe rpc.statd ould then notify the local rpc.lockdsothat it can make a decision o either continue hegraceperiodor extend t. It is possiblehat clientsdo not need o contact he serverafter failure of therpc.lockdso the percntageo use in determiningvalidity of the graceperiodwould not be simple.These future work items could possiblydecreasehe work load of the serverand ncreasehelikelihoodof correctand the timely reclamationoflockingstate.

    Bibliography[1] AT&T SystemV InterfaceDefinition.[2] Joel Bartlett, A NonStopKernel, In Proceed-ings of the Eighth Symposiumon OperatingSystemsrinciples,Vol 15No 5; Dec 1981.

    Bhide,Shepler[3] Anupam Bhide, Elmootaz Elnozahy andStephenMorgan,A Highly Available NetworkFile Server. ln Proceedings f the Winter 1991USENIXConference14)C. Juszczak,mproving the Performance nd

    Conectness f an NFS Server. ln Proceedingsof the 1988USENIXConference,[5] J. Katzman,A Fault-Tolerant omputingSys-tem. ln Proceedingsof the EleventhHawaiiInternational Conferenceon SystemSciences,Jan 1978.[6] N. Kronenberg,H. Levy and W. Strecker,VAXclusters:A Closely-CoupledSystem. InACM Transactions n ComputerSystems, oL4, No. 2, May 1986.Author Information

    Anupam Bhide is a research taff member atthe IBM T. J. WatsonReseach enter.His cunentresearchnterestsncludedatabaseystems, perat-ing systems,ault-tolerancend acqutball.n addi-tion to his work on fault+olerancen network frlesystems, e hasworkedon fault-tolerancen paralleldatabasemachine,and high performanceransactionprocessing.He graduatedwith a Ph.D. from theUniversityof California-Berkeleyn 1988. Previ-ously,he has a B. Tech from I.I.T.-Bombay nd aM.S. from Universityof Wisconsin-Madison. ecanbe reached t [email protected] Shepler is cunently a softweengineer or IBM in Austin Texas. His interestsinclude operating systemsand distributedsystemsspecifically distributed file systems. Spencergra-duatedn 1989 rom PurdueUniversitywith a Mas-ter of Sciencedegree n ComputerScience. Sincethat time, he has been working with and putlyresponsibleor NFS and its implementationn theAIX V3 operating ystem. Reachhim electronicallyat [email protected].

    184 Summer 92 USENIX - June 8-June 2r 1992 San Antonio, TX


Recommended