Carnegie Mellon University has developed an exciting file system.Mr. Braam, one of the developers, tells us all about it.byPeterJ . BraamTheCodadistributedfilesystemisastate-of-the-artexperimentalfilesystemdevelopedinthegroupofM. SatyanarayananatCarnegieMellonUniversity(CMU).NumerouspeoplecontributedtoCoda,whichnowincorporatesmanyfeaturesnotfoundinothersystems1. MobileComputing:disconnectedoperationformobileclients reintegrationofdatafromdisconnectedclients bandwidthadaptation2.FailureResilience: read/writereplicationservers resolutionofserver/serverconflicts handlesnetworkfailureswhichpartitiontheservers handlesdisconnectionofclientsclient3. Performanceandscalability: client-sidepersistentcachingoffiles, directoriesandattributesforhighperformance write-backcaching4. Security:Kerberos-likeauthentication*accesscontrollists(ACLs)5. Welldefinedsemanticsofsharing6.FreelyavailablesourcecodeDi st r i but ed Fi l e Syst emsAdistributedfilesystemstoresfilesononeormorecomputerscalledserversandmakesthemaccessibletoothercomputerscalledclients,wheretheyappearasnormal. -.sinceitcandisableallclientsfromaccessingcrucialinfor-mation.TheCodaprojecthaspaidattentiontomany oftheseissuesandimplementedthem as aresearchprototype.Coda wasoriginallyimplementedonMach2.6andhasrecentlybeenportedtoLinux, NetBSD and FreeBSD.MichaelCallahanportedalargeportionofCodatoWindows 95, and we are studying Windows NT to under-standthefeasibilityofportingCodatoNTCurrently,oureffortsareonportsandonmakingthesystem morerobust.A few newfeaturesarcbeingimplemented(write-backcachingandcells for cxample), andinseveralareas,compo-nentsofCodaarebeingreorganized.WehavealreadyreceivedverygeneroushelpfromusersontheNet,andwehopethatthiswillcontinue.PerhapsCodacanbecomeapopular, widely used and freely available distributed file sys-tem.Codaon aCl i entIf Coda is running on a client, which we shall take to be aLinuxworkstation,typingmount willshowafilesystem-oftypeCoda-mountedunder/coda.Allthefiles,whichanyof the servers may provide to the client, are available underthisdirectory,andallclientssee thesame name space. Aclientconnectsto Codaandnottoindividualservers,which comeintoplayinvisibly.ThisisquitedifferentfrommountingNFS file systems which is done on a per server,perexportbasis.InthemostcommonWindowssystems(Novell and Microsofts CIFS) as well as with Applesharefiles.Thereareseveraladvantagestousingfilesewers:thefiles are more widely available since many computcrs canaccesstheservers,andsharingthefilesfroma single loca-tioniseasierthandistributingcopiesoffilestoindividualclients.Backupsandsafetyoftheinformationareeasiertoarrangesinceonlytheserversneedtobebackedup.Theservers can provide large storage space, which might becostlyorimpracticaltosupplyto everyclient.Theuseful-ncss of adistributedfilesystembecomesclearwhenconsid-ering agroupofemployeessharingdocuments;however,moreispossible.Forexample,sharingapplicationsoftwareis an equally good candidate, In bothcases, systemadminis-trationbecomeseasier.Thereare manyproblemsfacingthe design of adistributedfilesystem.Transportingmanyfilesovercaneasilycreate sluggish performance and latency; networkbottlenecksandserveroverloadcanrcsult. Thesecurityofdataisanotherimportantissue:howcanwe be sure that aclientisreallyauthorizedtohaveaccesstoinformationandhow canwcpreventdatabeingsniffedoffthenetwork?Twcfurtherproblemsfacingthedesignarerelatedtofailures.Often,clientcomputersaremorereliablethanthenetworkconnectingthem, andnetworkfailurescanrenderaclientuseless. Similarly, a server failure can be very unpleasant,LINUXJOURNALcJUNEI998Figure2.ServersControlSecurity(IllustrationbyGaichMuramatsu)on theMacintosh,filesarealsomountedpervolume.Yetthe global name space is not new. The Andrew file system,Codaspredecessor,pioneeredtheideaandstoredall filesunder /afs.Similarly,thedistributedfile system DFS/DCEfrom OSFmountsitsfilesunder one directov.Microsoftsnewdistributedfilesystem(dfs)providesglue to put allsewer shares in asinglefiletree,similartotheglueprovid-edbyauto-mountdaemons andyellow pages on UNIX.Why is asinglemountpointadvantageous?Itmeans that allclientscanbeconfiguredidentically,anduserswillalwayssee thesamefiletree.Forlargeinstallationsthisisessential.With NFS, the client needs anup-to-datelistofserversandtheirexporteddirectoriesin/etc/fstab, while in Coda aclientmerelyneedstoknowwheretofindtheCodarootdirectory/coda.Whennewserversorsharesarcadded,theclientwilldiscovertheseautomaticallyinthecoda tree.WhenthekernelpassestheopenrequesttVenusforthefirsttime,Venusfetchestheentirefilefromtheservers,usingremoteprocedurecallstoreachtheservers.Itthenstoresthefileasacontainerfileinthecachearea(currently/usr/coda/venus.cache/).Thefileisnowanordi-naryfileonthelocaldisk,andread/writeoperationstthefiledonotreachVenusbutare(almost)entirelyhandledbythelocalfilesystcm (EXT2 for Linux).Codaread/writeoperationstakeplaceatthesamespeedasthosetolocalfiles.Ifthefileisopenedasecondtime,itwillnotbefetchedfromtheserversagain,butthelocalcopywillbeavailableforuseimmediately.Directoryfiles(remember,adirectoryisjustafile)aswellasalltheattributes(own-ership,permissionsandsize)areallcachedbyVenus,andVenusallowsoperationstoproceedwithoutcontactingtheserverifthefilesarepresentinthecache.Ifthefilehasbeenmodifiedanditisclosed,Venusupdatestheserversbysendingthenewfile.Otheroperationswhichmodifythefilesystem,suchasmakingdirectories,removingfilesordirectoriesandcreatingorremoving(symbolic)linksarcpropagatedtotheserversalso.To understandhowCodacanoperate when the networkconnectionstotheserverhavebeensevered,letsanalyzeasimplefilesystemoperation.Suppose wetype:cat/coda/tmp/footodisplaythecontentsofaCodafile.Whatactuallyhap-pens? The cat program will make a few system calls in rela-tion to the file, A system call is anoperationthroughwhichaprogramasksthekernelforservice.For example, whenopeningthefilethe kernel will want to doa lookupoperationtofindthe inodeofthefileandreturnafilehandleassociatedwiththefiletotheprogram.Theinodecontainstheinformationtoaccess the data inthefileandisusedbythekernel;thefilehandleis for theopeningprogramTheopencallentersthevirtualfilesystem(VFS)inthekernel,andwhenitisrealizedthattherequestisfora file in the/coda file system, it is handed to theCodafilesystemmoduleintheker-nel. Coda is a fairly minimalistic file-SoweseethatCodacachesalltheinformationitneedsontheclient,andonlyinformstheserverofupdatesmadetothefilesystem. Studieshaveconfirmedthatmodifica-tionsarequiterarecomparedtoreadonlyaccesstfiles,hencewehavegonealongwaytowardseliminatingclient-servercommunication.ThesemechanismstoaggressivelycachedatawereimplementedinAFSandDFS,butmostothersystemshavemorerudimentaryFigure3.Client/Venus/ViceJUNE1998LINUX JOURNALsystemmodule:itkeepsacacheofrecentlyansweredrequestsfromtheVFS,butotherwisepassestherequestontotheCodacachemanager,calledVenus.Venuswillchecktheclientdiskcachefortmp/fooandincaseofacachemiss,itcontactstheserverstoaskfortmp/foo.Whenthefilehasbeenlocated,Venusrespondstothekernel,whichinturnreturnsthecallingprogramfromthesystemcall.SchematicallywchavetheimageshowninFigure 3.Thefigureshowshowa uscrprogramasksforservicefromthekernelthrougha systemcall.ThekernelpassesituptoVenus,byallowingVenustoreadtherequestfromthecharacterdevice/dev/cfsO.Venustriestanswer the .._request, bylookinginitscache,askingserversorpossiblybydeclaringdisconnectionandservicingitindisconnectedmode,Disconnectedmodekicksinwhenthereisno net-workconnectiontanyserver whichhasthefiles.Typicallythishappensforlaptopswhentakenoffthenet-workrduringnetworkfailures.Ifserversfail,disconnect-edoperationcanalsocomeintoaction.Figure4.HoardedFilesarestickyinthecache.(IllustrationbyGaich Muramatsu)caching.WewillseelaterhowCodakeepsfilesconsistent,butfirstpursuewhatelseoneneedstosupportdisconnect-edoperation.Fr omCac hi ngt oDi sc onnec t edOper at i onTheoriginofdisconnectedoperationinCodaliesinoneof theoriginalresearchaimsoftheproject:toprovidea filesystemwithresiliencetonehvorkfailures.AFS,whichsup-portedthousandsofclientsinthelate80s onthe CMUcampus,hadbecomesolargethatnetworkoutagesandserverfailuresoccurredsomewherealmosteveryday.Thiswas anuisance.Codaalsoturnedouttobeawell-timedeffortbecauseoftherapidadventofmobileclients(viz.lap-tops).Codassupportforfailingnetworks and servers equal-lyappliedtomobileclients.We sawintheprevioussectionthatCodacachesallinformationneededtoprovideaccesstothedata.Whenupdatestothefilesystemaremade,theseneedtobepropa-gatedtotheserver.Innormalconnected mode,suchupdatesarepropagatedsynchronouslyto the server, i.e.,whentheupdateiscompleteontheclientithasalsobeenmade on the server. If aserverisunavailableorifthenet-workconnectionsbetween client and server fail, such anoperationwillincuratime-outerrorandfail.Sometimes,nothingcanbedone.Forexample,trying to fetch a file,which is not in the cache, from the sewers is impossiblewithoutanetworkconnection.Insuchcases,theerrormustbereportedtothecallingprogram.However,oftenthetime-outcanbehandledgracefullyasfollows.Tosupportdisconnectedcomputersortooperateinthepresencc of networkfailures,Venuswillnotreportfailure(s)totheuserwhenanupdateincursatime-out.Instead,Venusrealizesthattheserver(s)inquestionareunavailableandthattheupdateshouldbelogged on theclient.Duringdisconnection,allupdatesarestoredintheCML,theclientmodificationlog,whichis frequently flushedtodisk. TheuserdoesntnoticeanythingwhenCodaswitchestodiscon-nectedmode.Uponreconnectiontotheservers,VenuswillreintegratetheCMLitaskstheservertoreplaythe filesys-temupdatesontheserver,therebybringingtheserveruptodate.AdditionallytheCMLisoptimized-forexample,itcancelsoutifafileisfirstcreatedandthenremoved.There are twootherissuesofpro-found importance to disconnectedoperation. First, thereistheconcept ofhoarding files. SinceVenuscannotserve a cache miss during a disconec-iion,it would bc nice if it kept impor-tant files in the cache up to date, byfrequentlyasking the server tosendthe latestupdates. Suchimportant filesare in the usershoarddatabasewhichcan beautomatically constructed byspyingon the users file access.Updating the hoarded files iscalled ahoard walk. In practice, ourlaptopshoard enormous amountsofsystemsoftware, such as the Xl I WindowSystem binaries and libraries, or WabiandMicrosoftOffice.Since a file is afile,legacy applicationsrunjustfine.Thesecondissueis thatduring reintegration it mayappear that during the disconnection another client has mod-ified thefiletooandhasshippeditto the serverThisiscalled alocal/globalconflict(viz. Client/Servcr) which needsrepair. Repairscansometimes bedoneautomaticallybyapplication-specificresolvers(whichknow that one clientinscrtinganappointmentintoa calendar fileforMondayandanotherclient inserting one for Tuesday have not created anirresolvable conflict). Sometimes, but quite infrequently,humaninterventionisneededtorepair theconflict.On Friday one leavesthe office with a good deal of sourcecode hoarded on the laptop. After hacking in ones mountaincabin, the harsh return to the office on Monday (10 days laterofcourse)startswith au-integrationoftheupdatesmadeduringtheweekend. Mobilecomputingisborn.Vol umes,ServersandServerRepl i cati onIn most networkfilesystems,theserversenjoy astandardfilestructureandexporta directory toclients.Such a directo-r y of files on the server can be mounted on the client and iscalled anetworkshareinWindows jargon and a network filesystem in theUNIXworld.Formostofthesesystemsitisnotpracticalto mountfurtherdistributedvolumesinsidethealready mounted network volumes. Extreme care andthoughtgoesinto the serverlayoutofpartitions,directoriesand shares.Codas (and AFSs) organization differs substan-tially.FilesonCodaservers arc not storedintraditionalfilesys-tems.PartitionsontheCoda serverworkstationscanbemadeavailabletothe file server. These partitionswillcontainfileswhich are groupedintovolumes.Eachvolumehas adirectorystructure like a file system, i.e., a root directory for thevol-umc and atreebelowit.Avolumeisonthewholemuchsmallerthanapartition,butmuchlarger than a single direc-toryandis alogicalunitoffiles.Forexample, a users homedirectovwouldnormallybe asingleCodavolumeandsimi-larlytheCoda sources would reside in asinglevolume.Typically asingle serverwouldhavesomehundredsof vol-umes, perhaps with an average size approximately1OMB. Avolumeis amanageable amountoffiledatawhichis a verynaturalunit fromtheperspectiveofsystemadministrationandhasprovento be quiteflexible.Codaholdsvolumeanddirectoryinformation, access con-trollistsandfile attributeinformationin rawpartitions.These areaccessedthrougha log-basedrecoverablevirtualmemorypackage(RVM)forspeedandconsistency.OnlyfileFigure 5.FailureResilienceMethodsLFigure6.AVSGvs.VSG(IllustrationbyGaichMuramatsu)dataresidesinthefilesin serverpartitions.RVMhasbuilt-insupportfortransactions-thismeans that in case of a servercrash, the system can be restored to a consistent state withoutmucheffort.Avolumehas anameandanID,anditispossibletomounta volumeanywhere under /coda. For example, tomount the volume u.braam on /coda/usr/braam, issue thecommand:Codadoesnotallowmountpointstobeexistingdirectories;instead,itwillcreate a new directory as part of the mountprocess.ThiseliminatestheconfusionthatcanariseinmountingUNIXfilesystemsontopofexistingdirectories.While itseemsquitesimilartotheMacintoshandWindowstraditions of creating a "network drive and volumes, the cru-differenceis thatthemountpointisinvisibletotheclient:itappearsasanordinarydirectoryunder /coda. A sin-glevolumeenjoystheprivilegeofbeingtherootvolume;itisthevolumewhichismountedon/codaatstartuptime.Coda identifies a file bya triple of 32-bit integerscalledaJ UNEI998LINUXJOURNALFid: it consists of a VolumeId, a VnodeId and a Uniquifier.The VolumeIdidentifiesthevolumeinwhichthefileresides.The VnodeId is theinode number of the file, and theuniquifiersareneededforresolution.TheFidisuniqueinaclusterofCodaservers.Codahasread/writereplicationservers,i.e.,agroupofserverscanhandoutfiledatatoclients,andgenerallyupdatesaremadetoallserversinthisgroup.Theadvantageofthisishigheravailabilityofdata:ifoneserverfails,otherstakeoverwithoutaclientnoticingthefailure.VolumescanbestoredonagroupofserverscalledtheVSG(VolumeStorageGroup).A di st r i but ed f i l e syst em st or esf i l es on one ormor e c omput er sc al l ed ser ver s and mak est hemac c essi bl e t o ot herc omput er sc al l ed c l i ent s .Forreplicatedvolumes,the VolumeIdisareplicatedVolumeId. The replicated volume ID brings together aVolumeStorageGroupandalocalvolumeoneachofthemembers.l The VSGisalistofserverswhichholdacopyofthereplicatedvolume.l ThelocalvolumeforeachserverdefinesapartitionandlocalvolumeIDholdingthefilesandmeta-dataonthatserverWhenVenuswishestoaccessanobjectontheservers, itfirstneedstofindtheVolumeInfoforthevolumecontainingthefile.Thisinformationcontainsthelistofserversandthelocalvolume IDs oneachserverbywhichthevolumeisknown.Forfiles,thecommunicationwiththeserversinaVSGisread-one,write-many;thatis,readthefilefromasingle server in the VSG and propagate updates to all of theavailableVSGmembers,theAVSG.Codacanemploy multi-cast RPCs, and hence the write-many updates are not asevereperformancepenalty.Theoverheadoffirsthavingtofetchvolumeinformationisdeceptivetoo.Whilethereisaonetimelookupforvolumeinformation,subsequentfileaccessenjoysmuchshorterpathtraversals,sincetherootofthevolumeismuchnearerthaniscommoninmountinglargedirectories.Serverreplicationlikedisconnectedoperation,hastwocousinswhoneedintroduction:resolutionandrepair.Someservers in the VSG can become partitioned from othersthroughnetworkorserverfailures.Inthiscase,theAVSGforcertainobjectswillbestrictlysmallerthantheVSG.Updatescannotbepropagatedtoallservers,butonlytothemembers of the AVSG,therebyintroducingglobal(viz.serv-er/server)conflicts.Beforefetchinganobjectoritsattributes,Venuswillrequesttheversionstampsfromallavailableservers.Ifitdetectsthatsomeserversdonothavethelatestcopyoffiles,itinitiatesaresolutionprocesswhichtriestoautomaticallyresolvethedifferences.Ifthisfails,ausermustrepairmanu-ally.Theresolution,thoughinitiatedbytheclient,ishandledentirelybytheservers.Replicationserversandresolutionaremarvelous.Wehavesuffereddiskfailuresfromtimetotimein someofourservers.Torepairtheserver,allthatneedstobedoneistoputinanewdriveandtellCoda:resolveit. Theresolutionsystembringsthenewdiskuptodatewithrespecttootherservers.Coda i n Ac t i onCodaisinconstantactiveuseatCMU.Severaldozenclientsuseitfordevelopmentwork(ofCoda),asageneralpurposefilesystemandforspecificdisconnectedapplica-tions.ThefollowingtwoscenarioshaveexploitedthefeaturesofCodaverysuccessfully.WehavepurchasedanumberoflicensesforWabiandWindowssoftware.Wabiallowspeopleto run MS PowerPoint. WehavestoredWabiandWindows3.1includingMSOfficeinCodaanditissharedbyourclients.Ofcourse.inifileswithpreferencesareparticulartoagivenuser, butmostlibrariesandapplicationsarenot.Throughhoardingwecontinuetousethesoftwareondiscon-nectedlaptopcomputersforpresentations.Thisisfrequentlydoneatconferences.Overtheyearsofitsusewehavenotlostuserdata.Sometimesdisksinourservershavefailed,butsinceallofourvolumesarereplicated,wereplacedthediskwithanemptyoneandaskedtheresolutionmechanismtoupdatetherepairedserver.Alloneneedstodoforthisistotype1S -1Rintheaffectedfiletreewhenthenewdiskisinplace.Theabsenceofthefileontherepairedserverwillbenoticed,andresolutionwilltransportthefilesfromthegoodserverstothenewlyrepairedone.There are a number of compelling future applicationswhereCodacouldprovidesignificantbenefits. _FTPmirrorsitesshouldbeCodaclients.Asanexam-pleletstake ftp.redhat.com, whichhasmanymirrors.Eachmirroractivatesa Perlscript,whichwalkstheentiretreeatRedHattoseewhathasbeenupdatedandfetchesit-regardlessofwhetheritisneededatthemirror.ContrastthiswithRedHatstoringtheirftpareainCoda.MirrorsitesshouldallbecomeCodaclientstoo,butonlyRedHatwouldhavewritepermis-sion. When Red Hat updates a package, the Codaserversnotifythemirrorsitesthatthefilehaschanged.Themirrorsiteswillfetchthispackage,butonlythenexttimesomeonetriestofetchthispackage.WWWreplicationserversshouldbeCodaclients.Many ISPsarestrugglingwithafew WWWreplicationservers.Theyhavetoomuchaccesstousejustasinglehttp server. Using NFS to share the documents to beserved has proven problematic due to performanceproblems,somanualcopyingoffilestotheindividualserversisfrequentlydone.Codacouldcometotheres-cuesinceeachservercouldbeaCodaclientandholdthedatainitscache.Thisprovidesaccessatlocaldiskspeeds.CombinethiswithclientsoftheISPwhoupdatetheirwebinformationoff-lineandwehaveagoodapplicationformobileclientstoo.NetworkcomputerscouldexploitCodaasacachetodramaticallyimproveperformance.Updatestothenet-workcomputerwouldautomaticallybemadeastheybecomeavailableonservers,andforthemostpartthecomputerwouldoperatewithoutnetworktraffic,evenafter restarts.50 LINUXJOURNAL1 JUNE1998OurcurrenteffortsaremostlytoimprovethequalityofCoda.Theroughedges,whichinevitablycomewithresearchsystems,areslowlybeingsmoothedout.Write-backcachingwillbeaddedinorderforCodatooperatemuchfaster.Thedis-connectedoperationisanextremeform of write-back caching, and WCareleveragingthesemechanismsforwrite-backcachingduringconnectedoperation.Kerberossupportisbeingadded.Thenetworkingprotocolssup-portingCodaaremakingthiseasilypossible.WewouldliketohavecellswhichwillallowclientstoconnecttomorethanasingleCodaclustersimultaneously.FurtherportswillhopefullyallowmanysystemstouseCoda.Get t i ngCodaCoda is available by FTP fromftp.coda.cs.cmu.edu.YouwillfindRPM packages for Linux as well as tarfilesofthesource.KernelsupportforCodawillcomewiththeLinux2.2kernels.OntheWWWsitehttp://www.coda.cs.cmu.edu/,youwillfindadditionalresourcessuchasmail-inglists,manualsandresearchpapers.Peter adores hiswife Anne, andtogether they loveAlaska with itsmountains,wildlifeand a halfwayacceptable popula-tion density. Nothing is betterthan having a moose on theirporch there, or camping on anot too scary glacier. UntilMarch 1997 Peter was a facultymember in the Mathematicalinstitute at Oxford. In the sum-mer of 1995 Peter became presi-dent of Stelias Computing Inc.which assembled the InfoMagicWorkgroup Server. Dabblings inMach and the GNU Hurdevolved into porting Coda toLinux. E-mails about this withSatya, the visionary leader of theCoda and Odyssey projects, ledto a visit to Carnegie MellonUniversity in late 1996 andeventually to him joining theComputer Science faculty. He isnow leading the Coda project.He can be reached [email protected].